OlmOCR: Open-source tool to extract plain text from PDFs

313 pointsby eamag3 months ago

17 comments

vikp3 months ago

I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (<a href="https://github.com/VikParuchuri/marker">https://github.com/VikParuchuri/marker</a>) is quite flawed.Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - <a href="https://huggingface.co/datasets/datalab-to/marker_benchmark_comparison_olmocr_llm" rel="nofollow">https://huggingface.co/datasets/datalab-to/marker_benchmark_...</a> .You can see all benchmark code at <a href="https://github.com/VikParuchuri/marker/tree/master/benchmarks">https://github.com/VikParuchuri/marker/tree/master/benchmark...</a> .Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.

评论 #43227819 未加载

评论 #43223883 未加载

rahimnathwani3 months ago

Good:- no cloud service required, can run on local Nvidia GPU- outputs a single stream of text with the correct reading order (even for multi column PDF)- recognizes handwriting and stuffBad:- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)OP is the demo page, which lets you OCR 10 pages.The code needs an Nvidia GPU to run: <a href="https://github.com/allenai/olmocr">https://github.com/allenai/olmocr</a>Not sure if the VRAM requirements because I haven't tried running locally yet.

评论 #43216447 未加载

chad1n3 months ago

These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.

评论 #43216490 未加载

fschuett3 months ago

Very impressive, it's the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I've been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: <a href="https://i.imgur.com/YLuF9sa.png" rel="nofollow">https://i.imgur.com/YLuF9sa.png</a> - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the "E" of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.The only think I'd need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.

评论 #43216501 未加载

constantinum3 months ago

Tested it with the following documents:* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.** Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate[1] - <a href="https://pg.llmwhisperer.unstract.com/" rel="nofollow">https://pg.llmwhisperer.unstract.com/</a> [2] - <a href="https://github.com/DS4SD/docling">https://github.com/DS4SD/docling</a>

评论 #43220956 未加载

simonw3 months ago

I posted some notes on this here a couple of days ago: <a href="https://simonwillison.net/2025/Feb/26/olmocr/" rel="nofollow">https://simonwillison.net/2025/Feb/26/olmocr/</a>

评论 #43242212 未加载

评论 #43218551 未加载

mjnews2 months ago

Here's a concise version:Deployed a quick demo of this at <a href="https://olmocr.im/" rel="nofollow">https://olmocr.im/</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.Short, URL-forward, and focused on what HN readers care about (immediate testing + clear use case).

mjnews2 months ago

Deployed a quick demo of this at <a href="https://olmocr.im" rel="nofollow">https://olmocr.im</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.

TZubiri3 months ago

It's amazing how of these solutions exist.Such a hard problem that we create for ourselves.

zitterbewegung3 months ago

Would like to know how this compares to <a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>

评论 #43213070 未加载

评论 #43219730 未加载

评论 #43213086 未加载

Krasnol2 months ago

Make it an .exe file and storm the world's offices.

brianjking2 months ago

Has anyone figured out how to load this on a Huggingface endpoint?

johnthescott2 months ago

a surprising number of academic pdfs do not have the Title element set in the dictionary. seems like a jobs for "ai".

KennyBlanken2 months ago

Wasn't this just linked to here a few days ago with tests showing it has atrocious accuracy, misses a significant amount of text, and takes an order of magnitude more time (and several orders of magnitude more energy) compared to known OCR solutions?

Zardoz842 months ago

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

评论 #43219985 未加载

arnestrickmann2 months ago

thats so cool

xz18r3 months ago

Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.

评论 #43213972 未加载

评论 #43214314 未加载

评论 #43213094 未加载

17 comments

vikp3 months ago

评论 #43227819 未加载

评论 #43223883 未加载

rahimnathwani3 months ago

评论 #43216447 未加载

chad1n3 months ago

评论 #43216490 未加载

fschuett3 months ago

评论 #43216501 未加载

constantinum3 months ago

评论 #43220956 未加载

simonw3 months ago

I posted some notes on this here a couple of days ago: <a href="https://simonwillison.net/2025/Feb/26/olmocr/" rel="nofollow">https://simonwillison.net/2025/Feb/26/olmocr/</a>

评论 #43242212 未加载

评论 #43218551 未加载

mjnews2 months ago

TZubiri3 months ago

It's amazing how of these solutions exist.Such a hard problem that we create for ourselves.

zitterbewegung3 months ago

Would like to know how this compares to <a href="https://github.com/tesseract-ocr/tesseract">https://github.com/tesseract-ocr/tesseract</a>

评论 #43213070 未加载

评论 #43219730 未加载

评论 #43213086 未加载

Krasnol2 months ago

Make it an .exe file and storm the world's offices.

brianjking2 months ago

Has anyone figured out how to load this on a Huggingface endpoint?

johnthescott2 months ago

a surprising number of academic pdfs do not have the Title element set in the dictionary. seems like a jobs for "ai".

KennyBlanken2 months ago

Zardoz842 months ago

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

评论 #43219985 未加载

arnestrickmann2 months ago

thats so cool

xz18r3 months ago

Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.

评论 #43213972 未加载

评论 #43214314 未加载

评论 #43213094 未加载