Tesseract[0] is a system that is broken in to different parts, at least one does layout analysis and another does the actual OCR. Output is a different layer again. I believe it is an open source adaptation of what Google used for its books project. The interface was less than polished a few years ago, to the point where getting it running at all was rather difficult. However, for multilingual work (including Chinese) it is probably ideal.[1] Note that if you are scanning books there are now some interesting open hardware systems appearing online that turn pages and take photos with cameras, so you can scan books - without cutting them up - to a high resolution.<p>[0] <a href="https://github.com/tesseract-ocr/tesseract" rel="nofollow">https://github.com/tesseract-ocr/tesseract</a>
[1] <a href="https://github.com/tesseract-ocr/langdata" rel="nofollow">https://github.com/tesseract-ocr/langdata</a>
Beside Tesseract which was a state-of-the-art OCR software by HP in the early nineties and recovered by Google a few years ago and is open source.<p>There is Cuneiform, a former main competitor to ABBYY Finereader. CuneiForm got open sourced a view years ago, though in a sad state (project files where in VS C++ 6 ('98), comments in Russian), but a community fixed that and ported it to Linux. It's also probably the best one for Russian language. It also has an UI and some advanced features that only ABBYY amd Cuneiform have, but non of the other competitors (certainly no other open spurce OCR package). <a href="https://en.wikipedia.org/wiki/CuneiForm_(software)" rel="nofollow">https://en.wikipedia.org/wiki/CuneiForm_(software)</a>
Do you want to OCR several human languages, or do you want bindings/libraries in several programming languages? The question as written is a little ambiguous.
Tesseract can (and has to be) trained, so it can effectively support anything.<p>OCR isn't limited to language usually unless you are doing some really high end stuff when it does linguistic prediction but you only need that if you are working with really poor (image) quality sources.<p>But overall OCR is "language" agnostic, it is however usually not type set agnostic so what you would want to do is train it for whatever fonts are common for a particular language.<p>This gets slightly tricky if you have to do handwritten transcription or very stylized fonts but in those cases the "language" again is not an issue because your OCR program doesn't understand language to begin with.
The best tool would be something that I can iteratively improve using some ML methods, that I would run on Linux and integrate into my programs. And open source, of course.
I know, I want too much :)
Is there anything great (even if potentially pricey) for ICR (individual handwritten characters, usually separated by boxes) or handwriting? Preferably as a service.