Home Development of Websites A couple of words about internationalization of applications

A couple of words about internationalization of applications

by admin

I read Habr regularly for a long time and noticed that there are few developer-oriented articles about software localization.From my experience in managing localization projects I can say that localization is not only about translating strings and adapting the application to the context of a particular country, but also about constant confrontation (in ideal cases – equal interaction) with developers.
In this article I will try to use an example to show how you can create a so called localization-friendly code, that is, organize the resources in such a way as to make localization of an application much easier, reducing excessive time and money costs.
Let me mention right away that we are going to talk about internationalization I mean taking all linguistic peculiarities into account at the development stage. If your project’s resources are originally not earmarked for localization, but you decide to localize them later, making them ready for localization can cost much more than the income from it.
A couple of words about internationalization of applications

Use Unicode

In most cases, the issue of UTF-8 (or UTF-16) encoding comes up when localization for Asian languages, where the number of characters can be several thousand. Even if you are not planning Korean or Chinese localization at the moment, it is worth taking care of the universal encoding in advance. If your product localization strategy changes, it will be much harder to switch to another encoding on the fly. Tip: use Unicode by default for all resources, even if your project is still in Russian/English/any other language.
By the way, e.g. specifications JSON and YAML (these formats are often used to store localizable resources) prescribe the use of Unicode.

Take care of the fonts

This seemingly small thing is often a critical factor hindering localization. Make sure that the fonts you use have characters for the localization languages (again, primarily Asian languages, but also Hebrew, Arabic and diacritics for European languages).
Remember that
ä, à or ą ≠ a
just as in Russian "e" does not always equal "e".
In my practice I had a case where the developers themselves drew a font containing a set of letters for English only. When it came to localization into German and Polish, they had to finish drawing the letters with diacritical marks.

Leave room for maneuver

In addition to fonts, translating application texts prepares another pitfall for the layout.
Compare how one menu item can be translated into different languages
ru: Save as
en: Save as
fi: Save as
zh: 另存为
If we need only 3 characters for Chinese, we need 16 for Finnish! In addition to the number of characters, it’s also important the peculiarities of a particular font.
A couple of words about internationalization of applications
Let’s compare the Finnish and Chinese lines in terms of length (font for both languages is Arial Unicode MS, 12 point) – Finnish text (114 pixels) is 2.5 times longer than Chinese (45 pixels).
Therefore it is very important to have some extra space in interface elements to avoid cropping of the displayed text. If, in certain cases, there isn’t enough room, you can use automatic text resizing. However, this solution will lead to the fact that in different elements of the interface with a high probability the text will be displayed in different sizes.

Pseudolocalization

You can use pseudolocalization to find problems before you start translating. It is one of the methods of testing applications to check their readiness for localization. Its essence is that instead of translation in the resources is substituted the text in a pseudolanguage created by a special algorithm (depending on the software used). The most primitive example: instead of English text, a transliteration/transcription in Cyrillic letters is substituted:
Save as -> Save as
Save as -> Save az
This method allows you to check the following points :

  • whether diacritical marks (e.g. German, Polish) are displayed correctly;
  • whether languages with other fonts are displayed correctly (e.g., Chinese, Russian);
  • whether there are problems with the display of interface elements for languages with right-to-left direction of text (e.g., Arabic);
  • whether there are problems with displaying non-standard characters (e.g., in user names);
  • if all localizable resources are extracted to separate files (using text directly in the code has many problems, see below about hardcoding).

Pseudolocalization often uses machine translation in the desired language. On the one hand, this is a simple solution in case there are no special tools for generating pseudotranslations. On the other hand, I have seen on numerous occasions that developers have confused localized resources with pseudolocalized ones and even inadvertently replaced normal translation with machine translation in the repositories. Furthermore, machine translation does not always allow to evaluate the mapping of all characters in the language (for example, the letter œ is not so common in texts, but its mapping should also be tested).
For example, this is what the pseudo-translation plugin interface looks like in memoQ:
A couple of words about internationalization of applications
And this is what the result looks like with these settings :
A couple of words about internationalization of applications

External Resources

In order to have a complete overview of localizable material, it is necessary to separate all resources from the code. Multimedia information containing text (most often images as well as video and audio in games, for example) should also be stored separately, sorted by locale. First, it will greatly simplify the work of content creators, they will not have to dig in the code to fix some system message. Secondly, it will allow a localization manager to accurately calculate terms and budget for each language. Third, it will allow them to be incomparably flexible in dealing with multilingual content.
The most popular formats for exchanging localizable data are XLIFF and po files Anyway, modern automated translation systems are able to convert any files into formats translators can understand.
Google and Apple also strongly advise developers to get all the resources for localization out there : Recommendations for Android developers , Apple’s recommendations for internationalization

Hardcoding in internationalization

In continuation of the previous point it’s worth mentioning an important point. Localization involves not only the translation of words, but also the adaptation of numbers, units, date and time formats, and punctuation marks to local standards.

Punctuation marks

Many developers like to "sew" punctuation marks into the code, thinking that dots and question marks are exactly the same in all languages. But compare :
ru:

 Are you sure? 

en:

 Are you sure? 

fr:

 Are you sure? 

es:

 Are you sure? 

ar:

 هل أنت متأكد؟ 

In French the question mark is separated by a space (by the way, Hubr kept taking the space before the question mark, so I had to fiddle with the tags). In Spanish, the question mark consists of an upside-down question mark at the beginning and a normal question mark at the end of a phrase, and in Arabic it is on the left and facing the other way. If the question mark is taken from the code, not all users will feel comfortable with it (unless you make it localized, but why bother?).
In addition to punctuation, you should be careful with spaces, trusting the code to force them in. After all, there are languages where spaces between words are not used, such as Japanese.
They say that localization of Japanese or Chinese apps into European languages can be a living hell if the developers didn’t take into account the nuance that other languages separate words with spaces.
So, punctuation is part of the text, and should be moved to external resources.

Numbers

Numbers, like words, also need translation. Many developers forget about this, displaying numeric variables in familiar formats. Let’s compare :
ru: 18 765, 22
en: 18, 765.22
de: 18.765, 22
he: 18, 765.22
el: 18.765, 22
fa: 18٫765.22
Notice what symbol is used as thousandths and fractional separators. In English and Hebrew, the period and comma stand very differently than in German and Greek. And in Russian as the thousandth separator for the numbers > 9999 a space (inseparable) is used as a thousand separator. And in Farsi thousands are separated by a special symbol "mommaye" (U+066B), but there is no special standard for this language, both a comma and a space can be used as separators.
You can, of course, consider it a trivial matter and "those who need it will understand it that way". However, such little things can sometimes lead to serious misunderstandings, especially when it comes to prices or important engineering calculations.
Speaking of prices, let’s compare :
ru: 2, 25 €
en: €2.25
de-at: € 2, 25
de-f: 2, 25 €
lv: € 2, 25
lt: 2, 25 €
Different languages have different currency signs, so the conclusion is that it’s best not to hardcode these characters either. Moreover, as you can see, norms differ not only among languages, but also among language variants (in Austria and Germany). Even neighboring Latvia and Lithuania have different norms.

Units of measure

Sometimes it is necessary to adapt not only the appearance of a number to national standards, but also the number itself. We are talking about units of measurement. If they are used in your project, you should always find out which system is used in your country so that you can clearly tell the user about speed, length, mass, temperature, etc.
Message " You are traveling at 62 miles per hour " won’t tell the Pskov driver anything. Just like the message " You are traveling at a speed of 100 kilometers per hour " might put a Chicago driver in a tizzy.
In that case, it’s not enough to just give a numeric variable, you have to dig deeper and change the calculation formula depending on the locale. Although the ideal solution would still be to leave the choice of measurement system to the user in the application settings, making this setting independent of locale. In any case, local units of measurement should definitely be taken into account.

Not all languages have the same grammar

Forced line splitting

Some developers, when organizing text strings, disregard the grammar of other languages and split the text in a string into several values. As a result, text messages are assembled from several pieces according to the rules of Russian syntax (or the developer’s native language).If in English you can still somehow get out of it (which is also not often possible), then, for example, in German, with its strict rules about the order of words, when you put the fragments into a single sentence, you get complete nonsense. And in Arabic, where everything is written the other way around, this way of organizing content is completely unacceptable.
Quite a common example. A Russian-speaking user sees the message : " There are 5 days left until the end of the test period. Pleaseentera valid key ". In resources, this message looks like this :

 'trialexpires_1': "Until the end of the test period "'trialexpires_2sg': "Remaining "'trialexpires_2pl': "remainder "'trialexpires_4sg': " one day left."'trialexpires_4pl2': " days.""trialexpires_4pl3": " days.""enterkey": "Please enter a valid key." 

In principle, it is possible to make a trick and translate these "shreds" of text into English so that the translation is quite correct. With Arabic, where the direction of the text is different, this trick will not work. In German, the separable verb prefixes tend to flee to the end of the sentence. By the way, compare again the length of this phrase in different languages – the German version is 30% longer than the English one. Verbs are highlighted in bold. As you can see, in German they can consist of two parts, one of which can be quite far from the other.
en: Your trial period expires in 5 days. Please enter the valid key.
en: Your trial version runs in 5 days as of Please give a valid product key a
Another disadvantage is that this representation does not always allow the translator to catch the logic of the sentence and add the correct translation. Imagine how easy it is to get confused in such pieces of strings when there are thousands of them, like 5.
All this tells us that you should output the whole string to resources whenever possible, so that it not only has the most universal format, but is also comprehensible to the person who will translate it.
The solution for the described situation would be the following :

 'trialexpires': "The test period [count:day|remains] is {%n} [count:day|day|days left]."'enterkey': "Please enter a valid key." 

The operator count (or whatever you want to call it) substitutes the desired text value depending on the numeric variable %n. An Arabic translator writing from right to left won’t have any problems with this representation either, he just rearranges the variables.

Layout with forced line break

A fairly common problem is that developers want to ensure that text in an interface is represented as it should be by a forced line break. Let me give you an example right away.
The user sees the text like this :
This text is so big,
And the window is so small,
That I have to break
it in lines.

In resources it might look like this :

 'menubox_string1': "This text is so big, "'menubox_string2': "and the window is so small, "'menubox_string3': "I have to break it up""menubox_string4": "it's line by line." 

A translator will spend several times as much time translating such an outrageous thing, thinking about how to adapt it to his own language. If the text is longer (German or French), four lines may not be enough. If the text is shorter (Japanese or Chinese), a couple of lines will be left blank. Not to mention that if automatic translation technology is used (where each line translation is added to the translation memory and used repeatedly in similar or identical lines), this division will not help to make localization efficient.
There are two possible solutions here: either you can use automatic adaptation of the text to the window size; or you can use n if you don’t want to trust the machine.
Then the text in the resources will look like this :

 "This text is so big and the window is so small that I have to break it up into lines. 

In this case line breaks will be more flexible. For example, you can tell the translator the maximum number of characters in a line and ask for the most logical line spacing.

Excessive optimization

This mistake is made by overzealous content managers. Especially those who optimize English-language texts. In over-optimized resources, everything you can (all keywords and sometimes expressions) are replaced by permanent ones, which can be used in localization without regard to case, articles and other peculiarities of the grammar system of the target language. Of course, this allows better control over the consistency of terminology, and can also significantly reduce translation costs. But any optimization must be reasonable. Let’s look at an example :
The user sees the following text :
You can launch the application from the terminal. Press F2 to access the terminal.
In resources it is assembled from the following pieces :

 'cmd': "the terminal"'app': "the application"'act_42': "Press F2"'run_from_terminal': "You can launch {app} from {cmd}. {act_42} to access {cmd}." 

Suppose the interface uses a lot of words and phrases that the content manager has replaced with constants. He uses these constants in his texts because it is convenient. If one day you decide that the word "terminal" is unacceptable, and should use "command line", or the terminal in the system will be replaced by, say, a menu, then you will not need to handle a huge array of text. It will be enough just to replace the value of a constant. An additional advantage will be to reduce the total number of words. After all, the cost of translation is often calculated by the number of words (much less by the number of lines), and thus the overall cost of localization can be reduced. But that was not the case. Remember how I said that not all languages follow the same grammar rules? That’s very important here too.
Let’s see how the resources in this form will be translated into Russian.

 'cmd': 'terminal'.'app': 'application'.'act_42': 'Press F2'.'run_from_terminal': 'You can run {app} from {cmd} {act_42} to open {cmd}." 

The user will see the following :
" You can start the application from the terminal. Press F2 to open the terminal ".
If you replace the word "application" with the word "program", it becomes even worse, but more obvious :
" You can run the program from the terminal. Press F2 to open the terminal ".
Obviously, the case category is not taken into account in this approach.
One does not have to go far to find examples like this. Just look at the disgustingly localized Foursquare:
A couple of words about internationalization of applications
Or, look at the filter names. Not all of them are extensions of the phrase "Show Places…". They are probably constants used elsewhere as well. Well, or just mindless translation and lack of localization testing.
A couple of words about internationalization of applications
Facebook is constantly improving localization by volunteer users (not so long ago they published a vacancy for a localization manager, hopefully it will get even better soon), but, for example, this line doesn’t look quite Russian yet, but is built according to the rules of the original language.
A couple of words about internationalization of applications
In the Russian version, it would still be better to write "Place of study : %VUZ%".
A similar example from another section :
A couple of words about internationalization of applications
Conclusion : the use of text constants is certainly useful, but they must take into account other grammars. Ideal approach : use numerical constants, unit constants (taking into account grammatical features for each language, for example in Russian 2 plurals : 1 level , 2 levels , 5 levels), proper names (software product names), keyboard shortcuts.

Conclusion

Traditionally, software localization has been separated from development, moreover, many product managers think of localization as simply replacing the original text with foreign text. As a result, the product as a whole suffers, since :

  • non-optimized resources increase localization effort;
  • bugs identified in the localization process increase the time to market for the product and again increase the labor to fix them;
  • the localization budget is constantly increasing;
  • "Crooked" localization affects the number of buys/downloads of the app in a particular region and gives competitors an extra chance. My personal opinion is that a poorly localized product is much worse than no localized product at all.

Even if the application is written for the local market, localization may also be necessary. It is quite possible that in a couple of years there will be a great need for Yandex.Maps in Tajik in Moscow.
Try to develop your applications with internationalization in mind and cooperate with your localization manager or a translation agency already at the development stage to save yourself time, resources and money and to ensure the highest quality of local versions of your products.

You may also like