TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

OlmOCR: Open-source tool to extract plain text from PDFs

313 pointsby eamag3 months ago

17 comments

vikp3 months ago
I&#x27;m a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (<a href="https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker">https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker</a>) is quite flawed.<p>Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.<p>Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).<p>Hallucinations&#x2F;other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;datalab-to&#x2F;marker_benchmark_comparison_olmocr_llm" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;datalab-to&#x2F;marker_benchmark_...</a> .<p>You can see all benchmark code at <a href="https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker&#x2F;tree&#x2F;master&#x2F;benchmarks">https:&#x2F;&#x2F;github.com&#x2F;VikParuchuri&#x2F;marker&#x2F;tree&#x2F;master&#x2F;benchmark...</a> .<p>Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.
评论 #43227819 未加载
评论 #43223883 未加载
rahimnathwani3 months ago
Good:<p>- no cloud service required, can run on local Nvidia GPU<p>- outputs a single stream of text with the correct reading order (even for multi column PDF)<p>- recognizes handwriting and stuff<p>Bad:<p>- doesn&#x27;t seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)<p>OP is the demo page, which lets you OCR 10 pages.<p>The code needs an Nvidia GPU to run: <a href="https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;olmocr">https:&#x2F;&#x2F;github.com&#x2F;allenai&#x2F;olmocr</a><p>Not sure if the VRAM requirements because I haven&#x27;t tried running locally yet.
评论 #43216447 未加载
chad1n3 months ago
These &quot;OCR&quot; tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I&#x27;ve been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it&#x27;s not for &quot;important&quot; papers or papers that need 100% accuracy.
评论 #43216490 未加载
fschuett3 months ago
Very impressive, it&#x27;s the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I&#x27;ve been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: <a href="https:&#x2F;&#x2F;i.imgur.com&#x2F;YLuF9sa.png" rel="nofollow">https:&#x2F;&#x2F;i.imgur.com&#x2F;YLuF9sa.png</a> - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the &quot;E&quot; of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.<p>The only think I&#x27;d need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.
评论 #43216501 未加载
constantinum3 months ago
Tested it with the following documents:<p>* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?<p>* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*<p>* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??<p>* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.<p>* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.<p>Anyone curious, try LLMWhisperer[1] for OCR. It doesn&#x27;t use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.<p>There&#x27;s also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract&#x2F;EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate<p>[1] - <a href="https:&#x2F;&#x2F;pg.llmwhisperer.unstract.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pg.llmwhisperer.unstract.com&#x2F;</a> [2] - <a href="https:&#x2F;&#x2F;github.com&#x2F;DS4SD&#x2F;docling">https:&#x2F;&#x2F;github.com&#x2F;DS4SD&#x2F;docling</a>
评论 #43220956 未加载
simonw3 months ago
I posted some notes on this here a couple of days ago: <a href="https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Feb&#x2F;26&#x2F;olmocr&#x2F;" rel="nofollow">https:&#x2F;&#x2F;simonwillison.net&#x2F;2025&#x2F;Feb&#x2F;26&#x2F;olmocr&#x2F;</a>
评论 #43242212 未加载
评论 #43218551 未加载
mjnews2 months ago
Here&#x27;s a concise version:<p>Deployed a quick demo of this at <a href="https:&#x2F;&#x2F;olmocr.im&#x2F;" rel="nofollow">https:&#x2F;&#x2F;olmocr.im&#x2F;</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.<p>Short, URL-forward, and focused on what HN readers care about (immediate testing + clear use case).
mjnews2 months ago
Deployed a quick demo of this at <a href="https:&#x2F;&#x2F;olmocr.im" rel="nofollow">https:&#x2F;&#x2F;olmocr.im</a> if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.
TZubiri3 months ago
It&#x27;s amazing how of these solutions exist.<p>Such a hard problem that we create for ourselves.
zitterbewegung3 months ago
Would like to know how this compares to <a href="https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;tesseract">https:&#x2F;&#x2F;github.com&#x2F;tesseract-ocr&#x2F;tesseract</a>
评论 #43213070 未加载
评论 #43219730 未加载
评论 #43213086 未加载
Krasnol2 months ago
Make it an .exe file and storm the world&#x27;s offices.
brianjking2 months ago
Has anyone figured out how to load this on a Huggingface endpoint?
johnthescott2 months ago
a surprising number of academic pdfs do not have the Title element set in the dictionary. seems like a jobs for &quot;ai&quot;.
KennyBlanken2 months ago
Wasn&#x27;t this just linked to here a few days ago with tests showing it has atrocious accuracy, misses a significant amount of text, and takes an order of magnitude more time (and several orders of magnitude more energy) compared to known OCR solutions?
Zardoz842 months ago
I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.
评论 #43219985 未加载
arnestrickmann2 months ago
thats so cool
xz18r3 months ago
Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.
评论 #43213972 未加载
评论 #43214314 未加载
评论 #43213094 未加载