I am working on a project where I want to extract data from PDF document. Sometimes these are scanned PDF or forms.<p>I am looking for for an OCR tool (paid or open source) which can effectively extract data from poorly scanned documents and forms. What do you use?
It depends on what input amount, format and quality you have.<p>There are free / open source tools (like Tesseract), but if you would like to use them, some manual or (semi-)auto preprocessing steps are very important (threshold / binarization, deskew, noise removal[1]) too get nearly comparable results to commercial tools.<p>Some tesseract based solutions are better integrated with automatic preprocessing, you could take a look at Papermerge or other self hosted document management solutions[2].<p>There are also commercial SDKs around tesseract with good price point, like Vintasoft OCR[5], which supports automatic preprocessing and delivers a decent quality.<p>If you don't mind having a (free) clicking adventure with small amounts of documents, you could also try the free verson of PDF X-Change viewer[3], which has a small but pretty good OCR to embedded PDF-Layer option which makes PDFs "searchable". But the embedded OCR data cannot be easily extracted.<p>The best "no cloud" / offline solution I found, was Abbyy FineReader[4] which also has a command line tool, but if you really want a ready to use, easy and good quality solution, I would go with Google Lens (if you don't mind google)<p>[1] <a href="https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7" rel="nofollow noreferrer">https://towardsdatascience.com/pre-processing-in-ocr-fc231c6...</a><p>[2] <a href="https://github.com/awesome-selfhosted/awesome-selfhosted#document-management">https://github.com/awesome-selfhosted/awesome-selfhosted#doc...</a><p>[3] <a href="https://www.tracker-software.com/product/pdf-xchange-editor" rel="nofollow noreferrer">https://www.tracker-software.com/product/pdf-xchange-editor</a><p>[4] <a href="https://www.pdf-xchange.de/pdf-xchange-viewer/" rel="nofollow noreferrer">https://www.pdf-xchange.de/pdf-xchange-viewer/</a><p>[5] <a href="https://www.vintasoft.com/vsocr-dotnet-index.html" rel="nofollow noreferrer">https://www.vintasoft.com/vsocr-dotnet-index.html</a>
A bit off topic but I've just started using Google Lens to extract whole pages from books with my phone. Near perfect conversion to text is great for taking notes.
We started using tesseract for a project that needed to extract text from video frames. But in the end we moved to easyocr, as it needed less preprocessing for our use case.