TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What is the best open source OCR software supporting multiple languages?

86 pointsby postilaover 8 years ago

10 comments

contingenciesover 8 years ago
Tesseract[0] is a system that is broken in to different parts, at least one does layout analysis and another does the actual OCR. Output is a different layer again. I believe it is an open source adaptation of what Google used for its books project. The interface was less than polished a few years ago, to the point where getting it running at all was rather difficult. However, for multilingual work (including Chinese) it is probably ideal.[1] Note that if you are scanning books there are now some interesting open hardware systems appearing online that turn pages and take photos with cameras, so you can scan books - without cutting them up - to a high resolution.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;tesseract" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;tesseract</a> [1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;langdata" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;langdata</a>
评论 #12423869 未加载
评论 #12420158 未加载
评论 #12417210 未加载
frikover 8 years ago
Beside Tesseract which was a state-of-the-art OCR software by HP in the early nineties and recovered by Google a few years ago and is open source.<p>There is Cuneiform, a former main competitor to ABBYY Finereader. CuneiForm got open sourced a view years ago, though in a sad state (project files where in VS C++ 6 (&#x27;98), comments in Russian), but a community fixed that and ported it to Linux. It&#x27;s also probably the best one for Russian language. It also has an UI and some advanced features that only ABBYY amd Cuneiform have, but non of the other competitors (certainly no other open spurce OCR package). <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;CuneiForm_(software)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;CuneiForm_(software)</a>
deedubayaover 8 years ago
Sorry to hijack, but what about the best OCR service? I&#x27;d much rather farm the OCR work out to another service than trying to do it myself.
评论 #12417081 未加载
评论 #12417378 未加载
评论 #12417168 未加载
评论 #12417703 未加载
pgodzinover 8 years ago
I&#x27;ve used Tessarect with Tess4J Java wrappers, which has been pretty good.
评论 #12417241 未加载
msandfordover 8 years ago
Do you want to OCR several human languages, or do you want bindings&#x2F;libraries in several programming languages? The question as written is a little ambiguous.
评论 #12417223 未加载
dogma1138over 8 years ago
Tesseract can (and has to be) trained, so it can effectively support anything.<p>OCR isn&#x27;t limited to language usually unless you are doing some really high end stuff when it does linguistic prediction but you only need that if you are working with really poor (image) quality sources.<p>But overall OCR is &quot;language&quot; agnostic, it is however usually not type set agnostic so what you would want to do is train it for whatever fonts are common for a particular language.<p>This gets slightly tricky if you have to do handwritten transcription or very stylized fonts but in those cases the &quot;language&quot; again is not an issue because your OCR program doesn&#x27;t understand language to begin with.
acdover 8 years ago
Caffee Deep learning possibly outperforms Tesseract.<p><a href="https:&#x2F;&#x2F;christopher5106.github.io&#x2F;computer&#x2F;vision&#x2F;2015&#x2F;09&#x2F;14&#x2F;comparing-tesseract-and-deep-learning-for-ocr-optical-character-recognition.html" rel="nofollow">https:&#x2F;&#x2F;christopher5106.github.io&#x2F;computer&#x2F;vision&#x2F;2015&#x2F;09&#x2F;14...</a>
mynewtbover 8 years ago
Tesseract
postilaover 8 years ago
The best tool would be something that I can iteratively improve using some ML methods, that I would run on Linux and integrate into my programs. And open source, of course. I know, I want too much :)
评论 #12417557 未加载
kondroover 8 years ago
Is there anything great (even if potentially pricey) for ICR (individual handwritten characters, usually separated by boxes) or handwriting? Preferably as a service.