I had spectacular results from AWS Textract recently - which when this article was written (2019) wasn't yet openly available.<p>I fed it thousands of pages of historical scanned documents - including handwritten journals from the 1800s - and it could read them better than I could!<p>I built a tool to use it (since running it in bulk against PDFs in a bucket took a few too many steps) and wrote about my experiences with it here: <a href="https://simonwillison.net/2022/Jun/30/s3-ocr/" rel="nofollow">https://simonwillison.net/2022/Jun/30/s3-ocr/</a>
For some comparison, I recently did an OCR comparison for some work for a professor. To set some context, all documents were 1960s era typed or handwritten documents in English, specifically from this archive - <a href="http://allenarchive.iac.gatech.edu/" rel="nofollow">http://allenarchive.iac.gatech.edu/</a>. I hand transcribed 50 documents to use as a base comparison and ran them through the various OCR engines getting the results below.<p><pre><code> Overall Typed Handwritten
OCR Engine Leven Cosine Leven Cosine Leven Cosine
Amazon Textract 91.63% 98.14% 92.07% 98.76% 87.99% 92.10%
Google Vision 93.05% 97.97% 93.84% 98.99% 85.86% 88.11%
Microsoft Azure 80.32% 95.61% 80.65% 96.20% 79.14% 90.21%
TrOCR 78.66% 93.97% 80.64% 96.65% 59.96% 67.89%
PaddleOCR 84.82% 90.73% 88.60% 96.28% 49.64% 37.58%
Tesseract 86.67% 89.53% 91.14% 95.63% 44.54% 31.39%
Easy OCR 81.79% 85.07% 85.50% 91.89% 46.87% 19.23%
Keras OCR 58.03% 83.57% 59.32% 89.98% 46.08% 21.20%
</code></pre>
Leven is Levenshtein Distance. Overall is a weighted average of typed vs handwritten, 90/10 if I recall correctly. All results were run on my personal machine with a 5950X, 128 GB RAM, and a RTX 3080.<p>From my analysis the Amazon Textract was excellent, the best of all the paid ones, and while TrOCR and PaddleOCR were the best FOSS ones, the issue with them is that they <i>require</i> a GPU while Tesseract I could use on CPU alone. For instance to OCR all 50 documents.<p><pre><code> Tessearct 1:19
TrOCR (GPU) 27:33
TrOCR (CPU) 3:04:22
</code></pre>
TrOCR is great if you need to do a few or have GPUs to burn, but Tesseract is by far better if you need good enough for a large volume of documents, and for my project the intent was to make a software plugin that could be sent to libraries/universities, CPU is king.
The iOS / Apple OCR Swift API is drastically better than the ones I’ve tried online (eg. Microsoft) or the open source ones (Tesseract). Highly recommended. You can get fairly high throughput with M1 chips. The CNN is accelerated by the neural chip and the language model is accelerated by the GPU.
I went looking for a similar comparison a few months ago, and saw this: <a href="https://research.aimultiple.com/ocr-accuracy/" rel="nofollow">https://research.aimultiple.com/ocr-accuracy/</a>
It compared ABBYY FineReader 15,
Amazon Textract,
Google Cloud Platform Vision API,
Microsoft Azure Computer Vision API, and
Tesseract OCR Engine. I ended up using OCRmyPDF / Tesseract out of convenience, but doing a second pass with Google Cloud Vision, AWS Textract, or Abbyy is somewhere on my to-do list.
Several years ago, we did a project attempting to develop methods to OCR bilingual dictionaries. We just used Tesseract, because we were trying to develop methods to put stuff into particular fields (headword, part of speech, translations etc.), not compare OCR methods. As you might guess, there were lots of problems. But what <i>really</i> surprised me was that it was completely inaccurate in detecting bold characters--whereas I could detect bolding while standing far enough away from an image that I couldn't make out individual characters. And bold detection was crucial for parsing out some of the fields. (A more recent version of Tesseract doesn't even try to detect bold, afaict.)<p>We had another project later on aimed at simply detecting bold text, some success. But very little literature on this topic. Anyone know of OCR tools that do detect bolding?
For STEM applications, nothing beats Mathpix OCR.<p>FB Research uses it, London Stock Exchange uses it, Chegg uses it (in fact even recently transitioned to Mathpix OCR from Google vision), and many, many other companies and individuals.<p>Disclaimer: I'm the founder.
How about foreign languages? I've never had one good enough for Arabic. 3 years ago when needed it for a project, no OCR I found could read a properly scanned Arabic page. Had to go on Fiverr and paid a transcriber instead.
I used ABBYY Finereader around 8 years ago to OCR an old EE textbook, and I was really impressed with the results back then. I haven't heard any mention of the company since then until now, so it's interesting to see that they still seem to have some of the best available OCR tech. I've since tried to use Tesseract for small OCR jobs several times over the last few years, and have never found its results to be even remotely usable (which is a real shame).
What I really want is something with a similar set of convenient APIs and CLIs like ocrmypdf [1] that supports some of the more recent ML based systems. Ocrmypdf has really good ergonomics for me in terms of ease of scripting.<p>Something like DocTR [2] with the same api would be fantastic.<p>[1] <a href="https://ocrmypdf.readthedocs.io/en/latest/" rel="nofollow">https://ocrmypdf.readthedocs.io/en/latest/</a><p>[2] <a href="https://mindee.github.io/doctr/" rel="nofollow">https://mindee.github.io/doctr/</a>
What do folks think about these document types as a corpus for comparing tools? It's missing images and handwriting samples, but those types of documents might just be too variable to make conclusions about.<p>I remember Baidu's OCR giving excellent English results, but it looks like their API is deprecated now. Out of curiousity, I ran these samples through easyOCR by JaidedAI. Results at <a href="https://pastebin.com/RjzVd5Sf" rel="nofollow">https://pastebin.com/RjzVd5Sf</a>.
Found this comparison while researching OCR. It doesn't have the latest libraries like PaddleOCR but the performance of different OCR libraries is still quite apparent.
Interesting report. As far as I understand in total no of the systems was really better in all categories (or did I miss something?). A summary would have been helpful. Also it would be interesting whether the neural network based or the traditional Tesseract engine was used. I did similar experiments for a project six years ago and ended up with Tesseract and a custom traineddata file.
Does anyone use OCR to convert BluRay subtitles (.sup) to plaintext .srt files? I've used tools like SupRip and BDSup2Sub, but they've all required pretty significant cleanup afterwards. 'l', '1', and 'I' especially get mixed up a lot