Our search for the best OCR tool (2019)

91 pointsby longrodalmost 3 years ago

13 comments

simonwalmost 3 years ago

I had spectacular results from AWS Textract recently - which when this article was written (2019) wasn't yet openly available.I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: <a href="https://simonwillison.net/2022/Jun/30/s3-ocr/" rel="nofollow">https://simonwillison.net/2022/Jun/30/s3-ocr/</a>

评论 #32076831 未加载

driscoll42almost 3 years ago

For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - <a href="http://allenarchive.iac.gatech.edu/" rel="nofollow">http://allenarchive.iac.gatech.edu/</a>. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.<pre><code> Overall Typed Handwritten OCR Engine Leven Cosine Leven Cosine Leven Cosine Amazon Textract 91.63% 98.14% 92.07% 98.76% 87.99% 92.10% Google Vision 93.05% 97.97% 93.84% 98.99% 85.86% 88.11% Microsoft Azure 80.32% 95.61% 80.65% 96.20% 79.14% 90.21% TrOCR 78.66% 93.97% 80.64% 96.65% 59.96% 67.89% PaddleOCR 84.82% 90.73% 88.60% 96.28% 49.64% 37.58% Tesseract 86.67% 89.53% 91.14% 95.63% 44.54% 31.39% Easy OCR 81.79% 85.07% 85.50% 91.89% 46.87% 19.23% Keras OCR 58.03% 83.57% 59.32% 89.98% 46.08% 21.20% </code></pre> Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they require a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.<pre><code> Tessearct 1:19 TrOCR (GPU) 27:33 TrOCR (CPU) 3:04:22 </code></pre> TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.

评论 #32078098 未加载

bufoalmost 3 years ago

The iOS / Apple OCR Swift API is drastically better than the ones I’ve tried online (eg. Microsoft) or the open source ones (Tesseract). Highly recommended. You can get fairly high throughput with M1 chips. The CNN is accelerated by the neural chip and the language model is accelerated by the GPU.

评论 #32077582 未加载

pronoiacalmost 3 years ago

I went looking for a similar comparison a few months ago, and saw this: <a href="https://research.aimultiple.com/ocr-accuracy/" rel="nofollow">https://research.aimultiple.com/ocr-accuracy/</a> It compared ABBYY FineReader 15, Amazon Textract, Google Cloud Platform Vision API, Microsoft Azure Computer Vision API, and Tesseract OCR Engine. I ended up using OCRmyPDF / Tesseract out of convenience, but doing a second pass with Google Cloud Vision, AWS Textract, or Abbyy is somewhere on my to-do list.

评论 #32074343 未加载

mcswellalmost 3 years ago

Several years ago, we did a project attempting to develop methods to OCR bilingual dictionaries. We just used Tesseract, because we were trying to develop methods to put stuff into particular fields (headword, part of speech, translations etc.), not compare OCR methods. As you might guess, there were lots of problems. But what really surprised me was that it was completely inaccurate in detecting bold characters--whereas I could detect bolding while standing far enough away from an image that I couldn't make out individual characters. And bold detection was crucial for parsing out some of the fields. (A more recent version of Tesseract doesn't even try to detect bold, afaict.)We had another project later on aimed at simply detecting bold text, some success. But very little literature on this topic. Anyone know of OCR tools that do detect bolding?

nicodjimenezalmost 3 years ago

For STEM applications, nothing beats Mathpix OCR.FB Research uses it, London Stock Exchange uses it, Chegg uses it (in fact even recently transitioned to Mathpix OCR from Google vision), and many, many other companies and individuals.Disclaimer: I'm the founder.

albatrosstrophyalmost 3 years ago

How about foreign languages? I've never had one good enough for Arabic. 3 years ago when needed it for a project, no OCR I found could read a properly scanned Arabic page. Had to go on Fiverr and paid a transcriber instead.

bhaneyalmost 3 years ago

I used ABBYY Finereader around 8 years ago to OCR an old EE textbook, and I was really impressed with the results back then. I haven't heard any mention of the company since then until now, so it's interesting to see that they still seem to have some of the best available OCR tech. I've since tried to use Tesseract for small OCR jobs several times over the last few years, and have never found its results to be even remotely usable (which is a real shame).

noodlesUKalmost 3 years ago

What I really want is something with a similar set of convenient APIs and CLIs like ocrmypdf [1] that supports some of the more recent ML based systems. Ocrmypdf has really good ergonomics for me in terms of ease of scripting.Something like DocTR [2] with the same api would be fantastic.[1] <a href="https://ocrmypdf.readthedocs.io/en/latest/" rel="nofollow">https://ocrmypdf.readthedocs.io/en/latest/</a>[2] <a href="https://mindee.github.io/doctr/" rel="nofollow">https://mindee.github.io/doctr/</a>

iisan7almost 3 years ago

What do folks think about these document types as a corpus for comparing tools? It's missing images and handwriting samples, but those types of documents might just be too variable to make conclusions about.I remember Baidu's OCR giving excellent English results, but it looks like their API is deprecated now. Out of curiousity, I ran these samples through easyOCR by JaidedAI. Results at <a href="https://pastebin.com/RjzVd5Sf" rel="nofollow">https://pastebin.com/RjzVd5Sf</a>.

评论 #32075319 未加载

longrodalmost 3 years ago

Found this comparison while researching OCR. It doesn't have the latest libraries like PaddleOCR but the performance of different OCR libraries is still quite apparent.

评论 #32073875 未加载

Rochusalmost 3 years ago

Interesting report. As far as I understand in total no of the systems was really better in all categories (or did I miss something?). A summary would have been helpful. Also it would be interesting whether the neural network based or the traditional Tesseract engine was used. I did similar experiments for a project six years ago and ended up with Tesseract and a custom traineddata file.

nammialmost 3 years ago

Does anyone use OCR to convert BluRay subtitles (.sup) to plaintext .srt files? I've used tools like SupRip and BDSup2Sub, but they've all required pretty significant cleanup afterwards. 'l', '1', and 'I' especially get mixed up a lot

评论 #32074911 未加载

评论 #32077247 未加载