TechEcho

10 comments

Tesseract[0] is a system that is broken in to different parts, at least one does layout analysis and another does the actual OCR. Output is a different layer again. I believe it is an open source adaptation of what Google used for its books project. The interface was less than polished a few years ago, to the point where getting it running at all was rather difficult. However, for multilingual work (including Chinese) it is probably ideal.[1] Note that if you are scanning books there are now some interesting open hardware systems appearing online that turn pages and take photos with cameras, so you can scan books - without cutting them up - to a high resolution.[0] <a href="https://github.com/tesseract-ocr/tesseract" rel="nofollow">https://github.com/tesseract-ocr/tesseract</a> [1] <a href="https://github.com/tesseract-ocr/langdata" rel="nofollow">https://github.com/tesseract-ocr/langdata</a>

评论 #12423869 未加载

评论 #12420158 未加载

评论 #12417210 未加载

frikover 8 years ago

Beside Tesseract which was a state-of-the-art OCR software by HP in the early nineties and recovered by Google a few years ago and is open source.There is Cuneiform, a former main competitor to ABBYY Finereader. CuneiForm got open sourced a view years ago, though in a sad state (project files where in VS C++ 6 ('98), comments in Russian), but a community fixed that and ported it to Linux. It's also probably the best one for Russian language. It also has an UI and some advanced features that only ABBYY amd Cuneiform have, but non of the other competitors (certainly no other open spurce OCR package). <a href="https://en.wikipedia.org/wiki/CuneiForm_(software)" rel="nofollow">https://en.wikipedia.org/wiki/CuneiForm_(software)</a>

deedubayaover 8 years ago

Sorry to hijack, but what about the best OCR service? I'd much rather farm the OCR work out to another service than trying to do it myself.

评论 #12417081 未加载

评论 #12417378 未加载

评论 #12417168 未加载

评论 #12417703 未加载

pgodzinover 8 years ago

I've used Tessarect with Tess4J Java wrappers, which has been pretty good.

评论 #12417241 未加载

msandfordover 8 years ago

Do you want to OCR several human languages, or do you want bindings/libraries in several programming languages? The question as written is a little ambiguous.

评论 #12417223 未加载

dogma1138over 8 years ago

Tesseract can (and has to be) trained, so it can effectively support anything.OCR isn't limited to language usually unless you are doing some really high end stuff when it does linguistic prediction but you only need that if you are working with really poor (image) quality sources.But overall OCR is "language" agnostic, it is however usually not type set agnostic so what you would want to do is train it for whatever fonts are common for a particular language.This gets slightly tricky if you have to do handwritten transcription or very stylized fonts but in those cases the "language" again is not an issue because your OCR program doesn't understand language to begin with.

acdover 8 years ago

Caffee Deep learning possibly outperforms Tesseract.<a href="https://christopher5106.github.io/computer/vision/2015/09/14/comparing-tesseract-and-deep-learning-for-ocr-optical-character-recognition.html" rel="nofollow">https://christopher5106.github.io/computer/vision/2015/09/14...</a>

mynewtbover 8 years ago

Tesseract

postilaover 8 years ago

The best tool would be something that I can iteratively improve using some ML methods, that I would run on Linux and integrate into my programs. And open source, of course. I know, I want too much :)

评论 #12417557 未加载

kondroover 8 years ago

Is there anything great (even if potentially pricey) for ICR (individual handwritten characters, usually separated by boxes) or handwriting? Preferably as a service.

10 comments

contingenciesover 8 years ago

评论 #12423869 未加载

评论 #12420158 未加载

评论 #12417210 未加载

frikover 8 years ago

deedubayaover 8 years ago

Sorry to hijack, but what about the best OCR service? I'd much rather farm the OCR work out to another service than trying to do it myself.

评论 #12417081 未加载

评论 #12417378 未加载

评论 #12417168 未加载

评论 #12417703 未加载

pgodzinover 8 years ago

I've used Tessarect with Tess4J Java wrappers, which has been pretty good.

评论 #12417241 未加载

msandfordover 8 years ago

Do you want to OCR several human languages, or do you want bindings/libraries in several programming languages? The question as written is a little ambiguous.

评论 #12417223 未加载

dogma1138over 8 years ago

acdover 8 years ago

mynewtbover 8 years ago

Tesseract

postilaover 8 years ago

The best tool would be something that I can iteratively improve using some ML methods, that I would run on Linux and integrate into my programs. And open source, of course. I know, I want too much :)

评论 #12417557 未加载

kondroover 8 years ago

Is there anything great (even if potentially pricey) for ICR (individual handwritten characters, usually separated by boxes) or handwriting? Preferably as a service.

Ask HN: What is the best open source OCR software supporting multiple languages?

10 comments

Ask HN: What is the best open source OCR software supporting multiple languages?

10 comments