That is a very good article.<p>One architectural takeaway suggestion I've learned over the years, which is not obvious when reading that article:<p>"You should not assume that you can generate any part of any string visible to the user without the full context."<p>Whenever you design a localizable application, it isn't enough to provide a string that can be translated. You have to allow for delegation to the most specific piece of code dealing with the string, because only that piece of code will have the appropriate context to properly produce the string.<p>This means your code can't just assume it can generate strings somewhere deep inside the guts of a library. The programmer writing the final application that uses your library needs to be able to generate/override those strings on a per-case basis in the actual code that displays them to the user. The strings might be different between two UI windows.<p>Trust me, I know. I'm Polish. Few languages are as insane as my native tongue. If you don't believe me, take a peek at this concise 252-pages long introduction to Polish numerals: <a href="http://www.amazon.com/Liczebnik-grammar-numerals-exercises-language/dp/832420234X" rel="nofollow">http://www.amazon.com/Liczebnik-grammar-numerals-exercises-l...</a>
My previous company produced a guide to internationalization/localization/etc for engineers (this is kinda helpful to have when a mixed team of Japanese, Koreans, Chinese, Indians, and one very out of place white guy are trying to make multilingual software on top of business processes not designed with diverse client populations in mind).<p>The guide was somewhat whimsically named bluepill.doc and subtitled Welcome To The Real World. You have no idea how deep this rabbit hole gets. I did this for years and I am regularly surprised by novel, hard problems. It is like security. (It even intersects with security sometimes: since approximately no application developers actually understand encoding issues, there are virtually boundless <i>classes</i> of vulnerabilities arising from their (mis)understandings not matching technical reality.)<p>(I only found out later that the blue pill was the escape-back-into-comfortable-fantasy option. Whoopsie.)
MediaWiki is one of those websites that is translated into nearly every language known to humans, and it has a pretty elegant system for this. Not perfect, but good enough for almost anything. You can get away with minimal markup in the lexicon this way:<p>* In the code, messages are specified abstractly, e.g.<p><pre><code> print getMessage('found_x_files_in_x_dirs', $fileCount, $dirCount);
</code></pre>
* Languages each have their own class with a 'convertPlural' function that maps the quantity to the forms. So in english, that function might be simple, for Arabic, it's complex: <a href="http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/languages/classes/LanguageAr.php" rel="nofollow">http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/lang...</a><p>* Lexicons use a simple wiki markup to define the different forms of their language. To illustrate that the arguments don't have to be used in order, I did it in the reverse of how the code passes arguments.<p><pre><code> 'found_x_files_in_x_dirs' =>
"I searched $2 {{PLURAL:$2|directory|directories}}
and found $1 {{PLURAL:$1|file|files}}"
</code></pre>
So for a language like Arabic you write a similar pipe-delimited list of forms. You just have to know how to lay down the six different forms in the order that LanguageAr.php defined.<p>Note how this side-steps most (but not all) complicating issues like case or gender, so you don't have to mark it that way in the lexicon. If the word is used in the feminine gender, accusative case plural in the sentence, that's what the translator writes.<p>All this is mediated with the amazing <a href="http://translatewiki.net/" rel="nofollow">http://translatewiki.net/</a> website, run mostly by volunteers.
A detailed, insightful and well-written article on the pitfalls and complications of internationalization in software, written by two linguists. Very good read, independent of Perl.
Gettext [1] does actually offer a way around this problem, which works fairly well in practice.<p>You can define the number of plurals and rules to select a plural case. Then you have as many translations as plural forms in your translation (po) file.<p>Arabic for instance has 6 plural cases with the following rules [2]:<p><pre><code> nplurals=6; plural= n==0 ? 0 : n==1 ? 1 : n==2 ? 2 : n%100>=3 && n%100<=10 ? 3 : n%100>=11 ? 4 : 5;
</code></pre>
See also <a href="http://wiki.amule.org/index.php/Translations#Plural_forms" rel="nofollow">http://wiki.amule.org/index.php/Translations#Plural_forms</a> for an example of both a rule and the resulting code in the po file.<p>[1] <a href="http://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms" rel="nofollow">http://www.gnu.org/software/gettext/manual/gettext.html#Plur...</a>
[2] <a href="http://translate.sourceforge.net/wiki/l10n/pluralforms" rel="nofollow">http://translate.sourceforge.net/wiki/l10n/pluralforms</a>
His advice of thinking of phrases as functions is spot on. However I think he's missing the easiest way of formulating solutions that would be usable by many translators/linguists: pattern matching.<p>Imagine your phrase/function is called:<p>"Found %n1 matching files in %n2 directories"<p>You could pattern match for one particular language like this:<p><pre><code> %n1 == 0, %n2 == 0
%n1 == 1, %n2 == 1
%n1 > 1, %n2 == 1
... and so on ...
</code></pre>
With the matching being any (simple?) boolean function of the operator, applied in order. At the point where this becomes too cumbersome, you could fall back to proper code. I bet that this would be much easier to use for translators, with an optional fallback to a programmer if it gets too complex to spell out all combinations.
At Last.fm we used PHP+Smarty with a gettext-esque pre-compilation step (a bit like IntSmarty) for templates that allowed translators to embed smarty templating code, so you could write this in smarty:<p><pre><code> {l}Found {$d} directories{/l}
</code></pre>
and the replacement for each language could have its own switch on $d to decide how to translate it.<p>One (unavoidable?) downside is that the translators have to know some basic if/then/smarty syntax, and if they mess it up your template won't compile. Also you have to trust them somewhat, since they essentially get to execute PHP on your webserver.
This is why I'm scared to localize any of my applications. I'm in a good position where one of my web apps is used by lots of South Americans and Spanish people, and a couple of users have offered to translate it for free into their local tongue. Which is great... but I'm terrified about the potential amount of work when it comes to "You have X [nouns] set up" in table headings and the like.
One of the things that makes large ERP systems so complex is that these localization requirements also apply to business rules (e.g. for tax, payroll, even some rather basic accounting processes).
Its amazing how many different ways to do something there can be, even when it may seem initially that the way you've been taught is the only possible one. The something in this case being language.<p>Which is one reason to speak or code in more than one language.
And then you realize that your shiny custom font rendering assumes left-to-right and also that subsequent characters don't modify preceding character glyphs (cursive font).
Localizing games is even more complicated. User applications typically "talk" to the user, but games also have a variety characters speaking to each other. And in MMO/sandbox games, you may not be able to anticipate which characters will be conversing. Localizing inter-character conversations also introduces pronouns, which are not often used for user applications.<p>Game translators need to know all sorts of metadata about the speaker and listener characters' "social context", such as gender, age, and "honor". Is the old king speaking to a young peasant girl or an entire village? Is the young prince speaking to an old peasant woman or to his old grandmother? Is the grandmother speaking to her son, who happens to be the king?
Ouch, languages are complicated. Two simple solutions that I think might work for a small packaged-software company like mine:<p>1. Just don't bother internationalizing. I think that's often not a bad solution for small software businesses anyway. It's not much use making a software package that works in French/German/Russian/Chinese/Swahili unless you have the language skills or partnerships to sell and support the software in those languages as well.<p>2. Design the software such that messages are whole sentences or phrases that stand alone and can be translated one-to-one. Nothing fancy like the %g type stuff for number inserts. Keep it simple stupid.
<i>And this is to say nothing of the near impossibility of finding a commercial translator who would know even simple Perl.</i><p>Dammit, there has <i>got</i> to be a way for me to make money there.
Not always appropriate, but what about simply not using proper sentances. What would be the implications of using this kind of form:<p>Directories scanned: %g<p>Files found: %g, Directories with files: %g
It's a very good article. One thing that has bitten me while doing localization is the programmer instinct to reuse code as much as possible. If you have ten buttons labeled "Save" then you only need one translation for it, right? Wrong! After you have had to go back through the code and split out all those "Save" labels into different contexts the lesson is ingrained...
i have been using grasshopper [1] for some project & it already allows having function is translations.
it was nice to see a framework in a young ecosystem like node.js which can already handle this :)<p>[1]<a href="https://github.com/virtuo/grasshopper" rel="nofollow">https://github.com/virtuo/grasshopper</a>
A note to all the readers: the history part of the article is good, the technical Perl part is very ill-advised, partly due to when it was written (1999, prior to Gettext's plurals support). Please don't follow those advices if you are writing Perl.
In Java, I would use FreeMarker as template language for all sorts of things, including i18n, rather than the typical sprintf-like syntax, because FM templates can contain arbitrary complex code, but the simple template cases work too.
OTOH:<p>seach result: directory: NN, file: NN.<p>as in "search result: directory: 1, file: 0." or "search result: directory: 4, file: 23."<p>ie: nouns only, singular form, no verbs, no plural, etc....<p>Sure, it does not look "good" but its probably much more easy to translate.<p>Worse is better
Here is a one possible solution: using a grammer
<a href="http://www.grammaticalframework.org/" rel="nofollow">http://www.grammaticalframework.org/</a>
i have been using grasshopper[1] for some project & it already allows having function is translations.
it was nice to see a fw in a young ecosystem like node.js which can already handle this :)<p>[1]<a href="https://github.com/virtuo/grasshopper" rel="nofollow">https://github.com/virtuo/grasshopper</a>
For just this situation, Java offers the MessageFormat (and the ChoiceFormat refinement)<p>form.applyPattern(
"There {0,choice,0#are no files|1#is one file|1<are {0,number,integer} files}.");<p><a href="http://download.oracle.com/javase/6/docs/api/index.html?java/text/MessageFormat.html" rel="nofollow">http://download.oracle.com/javase/6/docs/api/index.html?java...</a>
Couldn't read it all. I guess the takeaway is that the localization system needs to be scriptable somehow, templates of the form "bla bla %1" are not sufficient.