TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What OCR tool do you use in your project?

8 点作者 vikasr111将近 2 年前
I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms.<p>I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use?

5 条评论

sandreas将近 2 年前
It depends on what input amount, format and quality you have.<p>There are free &#x2F; open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold &#x2F; binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools.<p>Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2].<p>There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality.<p>If you don&#x27;t mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs &quot;searchable&quot;. But the embedded OCR data cannot be easily extracted.<p>The best &quot;no cloud&quot; &#x2F; offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don&#x27;t mind google)<p>[1] <a href="https:&#x2F;&#x2F;towardsdatascience.com&#x2F;pre-processing-in-ocr-fc231c6035a7" rel="nofollow noreferrer">https:&#x2F;&#x2F;towardsdatascience.com&#x2F;pre-processing-in-ocr-fc231c6...</a><p>[2] <a href="https:&#x2F;&#x2F;github.com&#x2F;awesome-selfhosted&#x2F;awesome-selfhosted#document-management">https:&#x2F;&#x2F;github.com&#x2F;awesome-selfhosted&#x2F;awesome-selfhosted#doc...</a><p>[3] <a href="https:&#x2F;&#x2F;www.tracker-software.com&#x2F;product&#x2F;pdf-xchange-editor" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.tracker-software.com&#x2F;product&#x2F;pdf-xchange-editor</a><p>[4] <a href="https:&#x2F;&#x2F;www.pdf-xchange.de&#x2F;pdf-xchange-viewer&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.pdf-xchange.de&#x2F;pdf-xchange-viewer&#x2F;</a><p>[5] <a href="https:&#x2F;&#x2F;www.vintasoft.com&#x2F;vsocr-dotnet-index.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.vintasoft.com&#x2F;vsocr-dotnet-index.html</a>
beardyw将近 2 年前
A bit off topic but I&#x27;ve just started using Google Lens to extract whole pages from books with my phone. Near perfect conversion to text is great for taking notes.
评论 #36760989 未加载
smoldesu将近 2 年前
I still use Tesseract. It&#x27;s not the fastest or most-accurate anymore, but it gets what I need off of PDF files.
评论 #36759814 未加载
is_true将近 2 年前
We started using tesseract for a project that needed to extract text from video frames. But in the end we moved to easyocr, as it needed less preprocessing for our use case.
itake将近 2 年前
What languages do you need to support? Off the shelf models don&#x27;t work well on non-Latin languages. You may need to train your own.