The Google NLP team congratulate themselves so hard for handling studied-to-death Japanese text processing problems like Shift-JIS vs Unicode, half-width vs. full width kana, word boundary detection, and multiple readings for kanji? I wonder if they're ever going to try working on the languages of Ethiopia/Eritrea and their <i>seventy</i> different encodings for Ge'ez, crazy morphology, and almost complete lack of English bitexts ...<p><a href="http://www.punchdown.org/rvb/papers/EriPaper3C.html" rel="nofollow">http://www.punchdown.org/rvb/papers/EriPaper3C.html</a>