TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What OCR tool do you use in your project?

8 pointsby vikasr111almost 2 years ago
I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms.<p>I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use?

5 comments

sandreasalmost 2 years ago
It depends on what input amount, format and quality you have.<p>There are free &#x2F; open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold &#x2F; binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools.<p>Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2].<p>There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality.<p>If you don&#x27;t mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs &quot;searchable&quot;. But the embedded OCR data cannot be easily extracted.<p>The best &quot;no cloud&quot; &#x2F; offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don&#x27;t mind google)<p>[1] <a href="https:&#x2F;&#x2F;towardsdatascience.com&#x2F;pre-processing-in-ocr-fc231c6035a7" rel="nofollow noreferrer">https:&#x2F;&#x2F;towardsdatascience.com&#x2F;pre-processing-in-ocr-fc231c6...</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;awesome-selfhosted&#x2F;awesome-selfhosted#document-management">https:&#x2F;&#x2F;github.com&#x2F;awesome-selfhosted&#x2F;awesome-selfhosted#doc...</a><p>[3] <a href="https:&#x2F;&#x2F;www.tracker-software.com&#x2F;product&#x2F;pdf-xchange-editor" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.tracker-software.com&#x2F;product&#x2F;pdf-xchange-editor</a><p>[4] <a href="https:&#x2F;&#x2F;www.pdf-xchange.de&#x2F;pdf-xchange-viewer&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.pdf-xchange.de&#x2F;pdf-xchange-viewer&#x2F;</a><p>[5] <a href="https:&#x2F;&#x2F;www.vintasoft.com&#x2F;vsocr-dotnet-index.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.vintasoft.com&#x2F;vsocr-dotnet-index.html</a>
beardywalmost 2 years ago
A bit off topic but I&#x27;ve just started using Google Lens to extract whole pages from books with my phone. Near perfect conversion to text is great for taking notes.
评论 #36760989 未加载
smoldesualmost 2 years ago
I still use Tesseract. It&#x27;s not the fastest or most-accurate anymore, but it gets what I need off of PDF files.
评论 #36759814 未加载
is_truealmost 2 years ago
We started using tesseract for a project that needed to extract text from video frames. But in the end we moved to easyocr, as it needed less preprocessing for our use case.
itakealmost 2 years ago
What languages do you need to support? Off the shelf models don&#x27;t work well on non-Latin languages. You may need to train your own.