I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (<a href="https://github.com/VikParuchuri/marker">https://github.com/VikParuchuri/marker</a>) is quite flawed.<p>Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.<p>Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).<p>Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - <a href="https://huggingface.co/datasets/datalab-to/marker_benchmark_comparison_olmocr_llm" rel="nofollow">https://huggingface.co/datasets/datalab-to/marker_benchmark_...</a> .<p>You can see all benchmark code at <a href="https://github.com/VikParuchuri/marker/tree/master/benchmarks">https://github.com/VikParuchuri/marker/tree/master/benchmark...</a> .<p>Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.
Good:<p>- no cloud service required, can run on local Nvidia GPU<p>- outputs a single stream of text with the correct reading order (even for multi column PDF)<p>- recognizes handwriting and stuff<p>Bad:<p>- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)<p>OP is the demo page, which lets you OCR 10 pages.<p>The code needs an Nvidia GPU to run: <a href="https://github.com/allenai/olmocr">https://github.com/allenai/olmocr</a><p>Not sure if the VRAM requirements because I haven't tried running locally yet.
These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.
Very impressive, it's the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I've been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: <a href="https://i.imgur.com/YLuF9sa.png" rel="nofollow">https://i.imgur.com/YLuF9sa.png</a> - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the "E" of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.<p>The only think I'd need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.
Tested it with the following documents:<p>* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?<p>* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*<p>* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??<p>* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.<p>* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.<p>Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.<p>There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate<p>[1] - <a href="https://pg.llmwhisperer.unstract.com/" rel="nofollow">https://pg.llmwhisperer.unstract.com/</a>
[2] - <a href="https://github.com/DS4SD/docling">https://github.com/DS4SD/docling</a>
I posted some notes on this here a couple of days ago: <a href="https://simonwillison.net/2025/Feb/26/olmocr/" rel="nofollow">https://simonwillison.net/2025/Feb/26/olmocr/</a>
Here's a concise version:<p>Deployed a quick demo of this at <a href="https://olmocr.im/" rel="nofollow">https://olmocr.im/</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.<p>Short, URL-forward, and focused on what HN readers care about (immediate testing + clear use case).
Deployed a quick demo of this at <a href="https://olmocr.im" rel="nofollow">https://olmocr.im</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.
Would like to know how this compares to <a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>
Wasn't this just linked to here a few days ago with tests showing it has atrocious accuracy, misses a significant amount of text, and takes an order of magnitude more time (and several orders of magnitude more energy) compared to known OCR solutions?