I have the Kindle version of <i>The Seleucid Royal Economy</i> which for obvious reasons includes Greek text.<p>It's been OCRed, and the Greek has been mangled beyond belief. Sometimes the OCR will split a single character.<p>No real point to the story, but it feels relevant here. I see Rescribe has already encountered the problem: "In the second step we run the OCR on the preprocessed files, using our specifically trained packages and adapting language and character settings to the document at hand."<p>(I'm only complaining to a very small degree. Having a low-quality OCRed ebook available is much better than having no ebook available. And what is normally displayed is the image of the text, not the OCRed nonsense, so it doesn't matter that the Greek has been transformed into gibberish until you encounter the odd mid-character word break.)
Is Tesseract any good yet? Last I heard they were experimenting with deep learning based recognition but before that I've tried it and it didn't work at all. Kind of Pocketsphinx levels of rubbish.