Thanks for the write-up. A very interesting read. It's an area that is ripe for innovation, and a massively growing industry.<p>The XLIFF and TMX formats also offer flexibility in the handling of translated data, as with .po files, but there are many problems still to be solved, as contingencies mentions.<p>As you mention "Real people are still required to do the translations and verify them" and the army of professional translators and agencies in the market is on hand to do that, but developers often work in formats they are unfamiliar with.<p>The bulk of a freelancer's work is in MS Office files, run through a CAT (computer assisted translation) tool, and the resulting file (and translation memory, TM) is delivered. When a developer needs a bunch of strings translated they stray into unfamiliar territory for the average freelancer.<p>Specialists are out there, but a common format approach would help here. Most professional CAT tools (costing from 200-1000+ of your local currency units) can process .po files, which is a bonus, but doesn't solve many of the remaining problems out there.<p>A multi-language translation memory (i.e. several source/target combinations) would be useful in many cases, as would a simple 'export translatables' button in the admin dashboards of apps.<p>I hope more HN readers dig in to the problems mentioned here, as technical solutions could have a big influence on the future of globalisation(-ization!).
Thanks for the write-up, I deal with the same issue in our company and while we do work in Gettext with UTF-8 (that solves most basic issues just fine), it seems every project that does i18n is cooking it up in their own way and I have not been able to find many references online. I will probably make an article describing our setup when I get around to polishing it.<p>The concensus around Transifex in #i18n@freenode seems to be that the open source version is old and not maintained and should not be used. The SaaS offering is much newer and packs quite a bit more features.<p>The "good" open source offering appears to be Pootle [0].<p>Honestly, I would be very worried about depending on a cloud service such as Transifex for something that is so deeply embedded into our (pretty continuous) development process. This requires automation, and all the time invested in integrating with release processes and continuous integration can easily go overboard. Of course, if Transifex were seamlessly integrated with project management applications out of the box, then it wouldn't be such a risky proposition.<p>----<p>An interesting point about i18n that is quite independent from the tool selection is how you write your message identifiers. You can basically use labels (i.e, an ID for the string) or use the "original" string.<p>Here's the tradeoff: if you use an ID, you must reference the application constantly to understand what the translation should say (and in any non trivial application, this is a huge burden for translators), and there is either no string reuse (because places with the same intended content have used different IDs), or the need for an anal curator to go around chastising developers ("the OK button should always be ACTION_BUTTON_LABEL_OK!! fix it!!"). On the other hand, if you use original strings in English you will find that you experience language collisions (two places where the original string in English is the same, but the translated one is not), so you end up resorting to introducing artificial differences to make them unique (i.e "Request (verb)" and "Request (substantive)" instead of just "Request").<p>A hack that goes a long way if your engineering team is based off a country that uses a latin language, is to use that instead of English for original strings. Latin languages are typically more complex than English so collisions are greatly reduced. Chances are your translation team is also based in that country as well, so no harm done.<p>----<p>If you are doing branchy development, I put together a wiki page [1] on the Mercurial wiki with a script I use to merge translation catalogs (.po) seamlessly when doing branch merges. It can easily be used with git as well.<p>----<p>Links<p>[0] <a href="http://pootle.translatehouse.org/" rel="nofollow">http://pootle.translatehouse.org/</a><p>[1] <a href="http://mercurial.selenic.com/wiki/MergeGettext" rel="nofollow">http://mercurial.selenic.com/wiki/MergeGettext</a>
Great blog post about l10n and i18n! I'm working on improving that process in our company and currently I'm choosing Zanata [0] as a (Java-based) translation platform because out of Transifex's no longer maintained community edition (how unfortunate!) and Pootle, Zanata's installation actually was painless and the community around it is very responsive!<p>Too bad I didn't stumble upon Weblate [1] first though, it looks promising (thanks onemorepassword).<p>I've set up an independant "localization server" that executes the following process:<p>1) Regularly pulls new revisions of the code and updates to the latest revision.<p>2) A mercurial hook [2] is thus called and the source strings are extracted from the code with xgettext [3] so that new POT gettext files are generated.<p>3) The POT files are finally pushed to Zanata's server via its API.<p>We currently do in-house translations for one locale, while others are managed by an extenal translation provider. Employees in our company can just login (Zanata provides OpenID authentication) and collaboratively translate and review the application strings. Whereas Zanata can be used to export ressource files and push projects to our external translation provider's platform via their API.<p>But as others have said in this thread, l10n automation curently involves a lot of manual code glueing and adapting with your version control system. There's definitely potential since available solutions only address the translation problem and haven't gone very far in the whole process.<p>I'd be more than glad to exchange about the subject with others who have gone through the same experience!<p>---<p>Links<p>[0] <a href="http://zanata.org/" rel="nofollow">http://zanata.org/</a><p>[1] <a href="http://weblate.org/fr/" rel="nofollow">http://weblate.org/fr/</a><p>[3] <a href="http://mercurial.selenic.com/wiki/Hook" rel="nofollow">http://mercurial.selenic.com/wiki/Hook</a><p>[4] <a href="http://www.gnu.org/software/gettext/manual/html_node/xgettext-Invocation.html" rel="nofollow">http://www.gnu.org/software/gettext/manual/html_node/xgettex...</a>
"Finally the Java property file format was used (with UTF-8 encoding) which while having bugs in the import and export escaping these could at least be worked around."<p>The java property file format is ISO-8859-1 not UTF-8. I have to wonder if that's the bugs you hit? While you can have something that is UTF-8, there's a couple of wrinkles with trying to use that with java i18n.<p>See:
<a href="http://docs.oracle.com/javase/6/docs/api/java/util/ResourceBundle.html#getBundle%28java.lang.String,%20java.util.Locale,%20java.lang.ClassLoader%29" rel="nofollow">http://docs.oracle.com/javase/6/docs/api/java/util/ResourceB...</a><p>... when you load a resourcebundle, it tries to load a properties file, and it ends up calling this method:<p><a href="http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.InputStream%29" rel="nofollow">http://docs.oracle.com/javase/6/docs/api/java/util/Propertie...</a><p>... which mentions the encoding.<p>There's a couple of ways around this - one is to write a bunch of code to change how resourcebundles are loaded, the other is to use java's native2ascii tool in your to provide files that are correctly escaped.
Transifex looks nice, thanks for the tip, but it seems like you have to add a lot of glue to connect your own version control to their proprietary version control via their API.<p>What I would really like is something like Weblate (<a href="http://weblate.org" rel="nofollow">http://weblate.org</a>), that you can hook in directly to your code repo. Is there anything like that out there?
Whilst the key/value approach is solid, the 'industry standard' .po (GNU gettext / <a href="https://en.wikipedia.org/wiki/Gettext" rel="nofollow">https://en.wikipedia.org/wiki/Gettext</a>) format supports more features, like complex plural and ordinal/cardinal number support that is a requirement in some languages.<p>In addition, some of the biggest issues with internationalization in my experience (~exclusively i18n projects for 10+ years) are generally missing/broken support in certain components (great reasons to contribute resources upstream for open source projects!), managing translations over time, cultural issues, right-to-left, differing program-level logic (eg. maximum SMS message length variations based upon character set requirements), differing seasons/days of operation/holidays. Calendars are of course a pain (though a solved one), as are timezones - for which a truly synchronized, global approach is frustratingly hard to deploy at the best of times.