A Localization Horror Story: It Could Happen To You

266 pointsby gspyrouover 14 years ago

27 comments

jwrover 14 years ago

That is a very good article.One architectural takeaway suggestion I've learned over the years, which is not obvious when reading that article:"You should not assume that you can generate any part of any string visible to the user without the full context."Whenever you design a localizable application, it isn't enough to provide a string that can be translated. You have to allow for delegation to the most specific piece of code dealing with the string, because only that piece of code will have the appropriate context to properly produce the string.This means your code can't just assume it can generate strings somewhere deep inside the guts of a library. The programmer writing the final application that uses your library needs to be able to generate/override those strings on a per-case basis in the actual code that displays them to the user. The strings might be different between two UI windows.Trust me, I know. I'm Polish. Few languages are as insane as my native tongue. If you don't believe me, take a peek at this concise 252-pages long introduction to Polish numerals: <a href="http://www.amazon.com/Liczebnik-grammar-numerals-exercises-language/dp/832420234X" rel="nofollow">http://www.amazon.com/Liczebnik-grammar-numerals-exercises-l...</a>

评论 #2095723 未加载

评论 #2095726 未加载

评论 #2096056 未加载

patio11over 14 years ago

My previous company produced a guide to internationalization/localization/etc for engineers (this is kinda helpful to have when a mixed team of Japanese, Koreans, Chinese, Indians, and one very out of place white guy are trying to make multilingual software on top of business processes not designed with diverse client populations in mind).The guide was somewhat whimsically named bluepill.doc and subtitled Welcome To The Real World. You have no idea how deep this rabbit hole gets. I did this for years and I am regularly surprised by novel, hard problems. It is like security. (It even intersects with security sometimes: since approximately no application developers actually understand encoding issues, there are virtually boundless classes of vulnerabilities arising from their (mis)understandings not matching technical reality.)(I only found out later that the blue pill was the escape-back-into-comfortable-fantasy option. Whoopsie.)

评论 #2097011 未加载

评论 #2095737 未加载

neilkover 14 years ago

MediaWiki is one of those websites that is translated into nearly every language known to humans, and it has a pretty elegant system for this. Not perfect, but good enough for almost anything. You can get away with minimal markup in the lexicon this way:* In the code, messages are specified abstractly, e.g.<pre><code> print getMessage('found_x_files_in_x_dirs', $fileCount, $dirCount); </code></pre> * Languages each have their own class with a 'convertPlural' function that maps the quantity to the forms. So in english, that function might be simple, for Arabic, it's complex: <a href="http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/languages/classes/LanguageAr.php" rel="nofollow">http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/lang...</a>* Lexicons use a simple wiki markup to define the different forms of their language. To illustrate that the arguments don't have to be used in order, I did it in the reverse of how the code passes arguments.<pre><code> 'found_x_files_in_x_dirs' => "I searched $2 {{PLURAL:$2|directory|directories}} and found $1 {{PLURAL:$1|file|files}}" </code></pre> So for a language like Arabic you write a similar pipe-delimited list of forms. You just have to know how to lay down the six different forms in the order that LanguageAr.php defined.Note how this side-steps most (but not all) complicating issues like case or gender, so you don't have to mark it that way in the lexicon. If the word is used in the feminine gender, accusative case plural in the sentence, that's what the translator writes.All this is mediated with the amazing <a href="http://translatewiki.net/" rel="nofollow">http://translatewiki.net/</a> website, run mostly by volunteers.

thomas11over 14 years ago

A detailed, insightful and well-written article on the pitfalls and complications of internationalization in software, written by two linguists. Very good read, independent of Perl.

johkraover 14 years ago

Gettext [1] does actually offer a way around this problem, which works fairly well in practice.You can define the number of plurals and rules to select a plural case. Then you have as many translations as plural forms in your translation (po) file.Arabic for instance has 6 plural cases with the following rules [2]:<pre><code> nplurals=6; plural= n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5; </code></pre> See also <a href="http://wiki.amule.org/index.php/Translations#Plural_forms" rel="nofollow">http://wiki.amule.org/index.php/Translations#Plural_forms</a> for an example of both a rule and the resulting code in the po file.[1] <a href="http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms" rel="nofollow">http://www.gnu.org/software/gettext/manual/gettext.html#Plur...</a> [2] <a href="http://translate.sourceforge.net/wiki/l10n/pluralforms" rel="nofollow">http://translate.sourceforge.net/wiki/l10n/pluralforms</a>

Nitrampover 14 years ago

His advice of thinking of phrases as functions is spot on. However I think he's missing the easiest way of formulating solutions that would be usable by many translators/linguists: pattern matching.Imagine your phrase/function is called:"Found %n1 matching files in %n2 directories"You could pattern match for one particular language like this:<pre><code> %n1 == 0, %n2 == 0 %n1 == 1, %n2 == 1 %n1 > 1, %n2 == 1 ... and so on ... </code></pre> With the matching being any (simple?) boolean function of the operator, applied in order. At the point where this becomes too cumbersome, you could fall back to proper code. I bet that this would be much easier to use for translators, with an optional fallback to a programmer if it gets too complex to spell out all combinations.

评论 #2095765 未加载

评论 #2095748 未加载

metabrewover 14 years ago

At Last.fm we used PHP+Smarty with a gettext-esque pre-compilation step (a bit like IntSmarty) for templates that allowed translators to embed smarty templating code, so you could write this in smarty:<pre><code> {l}Found {$d} directories{/l} </code></pre> and the replacement for each language could have its own switch on $d to decide how to translate it.One (unavoidable?) downside is that the translators have to know some basic if/then/smarty syntax, and if they mess it up your template won't compile. Also you have to trust them somewhat, since they essentially get to execute PHP on your webserver.

评论 #2096033 未加载

评论 #2098127 未加载

评论 #2096929 未加载

mootothemaxover 14 years ago

This is why I'm scared to localize any of my applications. I'm in a good position where one of my web apps is used by lots of South Americans and Spanish people, and a couple of users have offered to translate it for free into their local tongue. Which is great... but I'm terrified about the potential amount of work when it comes to "You have X [nouns] set up" in table headings and the like.

评论 #2097457 未加载

评论 #2096319 未加载

评论 #2098499 未加载

revoradover 14 years ago

I'd take the easy route:printf("Directories scanned: %g", $directory_count);

评论 #2095731 未加载

评论 #2096603 未加载

arethuzaover 14 years ago

One of the things that makes large ERP systems so complex is that these localization requirements also apply to business rules (e.g. for tax, payroll, even some rather basic accounting processes).

评论 #2096809 未加载

stretchwithmeover 14 years ago

Its amazing how many different ways to do something there can be, even when it may seem initially that the way you've been taught is the only possible one. The something in this case being language.Which is one reason to speak or code in more than one language.

JabavuAdamsover 14 years ago

And then you realize that your shiny custom font rendering assumes left-to-right and also that subsequent characters don't modify preceding character glyphs (cursive font).

cpetersoover 14 years ago

Localizing games is even more complicated. User applications typically "talk" to the user, but games also have a variety characters speaking to each other. And in MMO/sandbox games, you may not be able to anticipate which characters will be conversing. Localizing inter-character conversations also introduces pronouns, which are not often used for user applications.Game translators need to know all sorts of metadata about the speaker and listener characters' "social context", such as gender, age, and "honor". Is the old king speaking to a young peasant girl or an entire village? Is the young prince speaking to an old peasant woman or to his old grandmother? Is the grandmother speaking to her son, who happens to be the king?

bromleyover 14 years ago

Ouch, languages are complicated. Two simple solutions that I think might work for a small packaged-software company like mine:1. Just don't bother internationalizing. I think that's often not a bad solution for small software businesses anyway. It's not much use making a software package that works in French/German/Russian/Chinese/Swahili unless you have the language skills or partnerships to sell and support the software in those languages as well.2. Design the software such that messages are whole sentences or phrases that stand alone and can be translated one-to-one. Nothing fancy like the %g type stuff for number inserts. Keep it simple stupid.

评论 #2096129 未加载

Vivtekover 14 years ago

And this is to say nothing of the near impossibility of finding a commercial translator who would know even simple Perl.Dammit, there has got to be a way for me to make money there.

frankcover 14 years ago

Not always appropriate, but what about simply not using proper sentances. What would be the implications of using this kind of form:Directories scanned: %gFiles found: %g, Directories with files: %g

speledingover 14 years ago

It's a very good article. One thing that has bitten me while doing localization is the programmer instinct to reuse code as much as possible. If you have ten buttons labeled "Save" then you only need one translation for it, right? Wrong! After you have had to go back through the code and split out all those "Save" labels into different contexts the lesson is ingrained...

ajithvlover 14 years ago

i have been using grasshopper [1] for some project & it already allows having function is translations. it was nice to see a framework in a young ecosystem like node.js which can already handle this :)[1]<a href="https://github.com/virtuo/grasshopper" rel="nofollow">https://github.com/virtuo/grasshopper</a>

weavejesterover 14 years ago

I'm rather of the opinion that the translations should be sandboxed scripts, rather than strings.

pronikover 14 years ago

A note to all the readers: the history part of the article is good, the technical Perl part is very ill-advised, partly due to when it was written (1999, prior to Gettext's plurals support). Please don't follow those advices if you are writing Perl.

wlievensover 14 years ago

In Java, I would use FreeMarker as template language for all sorts of things, including i18n, rather than the typical sprintf-like syntax, because FM templates can contain arbitrary complex code, but the simple template cases work too.

jhrobertover 14 years ago

OTOH:seach result: directory: NN, file: NN.as in "search result: directory: 1, file: 0." or "search result: directory: 4, file: 23."ie: nouns only, singular form, no verbs, no plural, etc....Sure, it does not look "good" but its probably much more easy to translate.Worse is better

gregwebsover 14 years ago

Here is a one possible solution: using a grammer <a href="http://www.grammaticalframework.org/" rel="nofollow">http://www.grammaticalframework.org/</a>

ajithvlover 14 years ago

i have been using grasshopper[1] for some project & it already allows having function is translations. it was nice to see a fw in a young ecosystem like node.js which can already handle this :)[1]<a href="https://github.com/virtuo/grasshopper" rel="nofollow">https://github.com/virtuo/grasshopper</a>

DenisMover 14 years ago

So the answer is to have translators write perl? Really?Is this the best we, as an industry, can do?

locopatiover 14 years ago

For just this situation, Java offers the MessageFormat (and the ChoiceFormat refinement)form.applyPattern( "There {0,choice,0#are no files|1#is one file|1<are {0,number,integer} files}.");<a href="http://download.oracle.com/javase/6/docs/api/index.html?java/text/MessageFormat.html" rel="nofollow">http://download.oracle.com/javase/6/docs/api/index.html?java...</a>

评论 #2095877 未加载

Tichyover 14 years ago

Couldn't read it all. I guess the takeaway is that the localization system needs to be scriptable somehow, templates of the form "bla bla %1" are not sufficient.

评论 #2095997 未加载

27 comments

jwrover 14 years ago

评论 #2095723 未加载

评论 #2095726 未加载

评论 #2096056 未加载

patio11over 14 years ago

评论 #2097011 未加载

评论 #2095737 未加载

neilkover 14 years ago

thomas11over 14 years ago

A detailed, insightful and well-written article on the pitfalls and complications of internationalization in software, written by two linguists. Very good read, independent of Perl.

johkraover 14 years ago

Nitrampover 14 years ago

评论 #2095765 未加载

评论 #2095748 未加载

metabrewover 14 years ago

评论 #2096033 未加载

评论 #2098127 未加载

评论 #2096929 未加载

mootothemaxover 14 years ago

评论 #2097457 未加载

评论 #2096319 未加载

评论 #2098499 未加载

revoradover 14 years ago

I'd take the easy route:printf("Directories scanned: %g", $directory_count);

评论 #2095731 未加载

评论 #2096603 未加载

arethuzaover 14 years ago

One of the things that makes large ERP systems so complex is that these localization requirements also apply to business rules (e.g. for tax, payroll, even some rather basic accounting processes).

评论 #2096809 未加载

stretchwithmeover 14 years ago

JabavuAdamsover 14 years ago

And then you realize that your shiny custom font rendering assumes left-to-right and also that subsequent characters don't modify preceding character glyphs (cursive font).

cpetersoover 14 years ago

bromleyover 14 years ago

评论 #2096129 未加载

Vivtekover 14 years ago

And this is to say nothing of the near impossibility of finding a commercial translator who would know even simple Perl.Dammit, there has got to be a way for me to make money there.

frankcover 14 years ago

speledingover 14 years ago

ajithvlover 14 years ago

weavejesterover 14 years ago

I'm rather of the opinion that the translations should be sandboxed scripts, rather than strings.

pronikover 14 years ago

wlievensover 14 years ago

jhrobertover 14 years ago

gregwebsover 14 years ago

Here is a one possible solution: using a grammer <a href="http://www.grammaticalframework.org/" rel="nofollow">http://www.grammaticalframework.org/</a>

ajithvlover 14 years ago

DenisMover 14 years ago

So the answer is to have translators write perl? Really?Is this the best we, as an industry, can do?

locopatiover 14 years ago

评论 #2095877 未加载

Tichyover 14 years ago

Couldn't read it all. I guess the takeaway is that the localization system needs to be scriptable somehow, templates of the form "bla bla %1" are not sufficient.

评论 #2095997 未加载