Show HN: Qwen-2.5-32B is now the best open source OCR model

211 点作者 themanmaran大约 1 个月前

Last week was big for open source LLMs. We got:- Qwen 2.5 VL (72b and 32b)- Gemma-3 (27b)- DeepSeek-v3-0324And a couple weeks ago we got the new mistral-ocr model. We updated our OCR benchmark to include the new models.We evaluated 1,000 documents for JSON extraction accuracy. Major takeaways:- Qwen 2.5 VL (72b and 32b) are by far the most impressive. Both landed right around 75% accuracy (equivalent to GPT-4o’s performance). Qwen 72b was only 0.4% above 32b. Within the margin of error.- Both Qwen models passed mistral-ocr (72.2%), which is specifically trained for OCR.- Gemma-3 (27B) only scored 42.9%. Particularly surprising given that it's architecture is based on Gemini 2.0 which still tops the accuracy chart.The data set and benchmark runner is fully open source. You can check out the code and reproduction steps here:- <a href="https://getomni.ai/blog/benchmarking-open-source-models-for-ocr">https://getomni.ai/blog/benchmarking-open-source-models-for-...</a>- <a href="https://github.com/getomni-ai/benchmark" rel="nofollow">https://github.com/getomni-ai/benchmark</a>- <a href="https://huggingface.co/datasets/getomni-ai/ocr-benchmark" rel="nofollow">https://huggingface.co/datasets/getomni-ai/ocr-benchmark</a>

13 条评论

jauntywundrkind大约 1 个月前

The 32b sounds like it has some useful small tweakers. Tweaks to make output more human friendly, better mathematical reasoning, better fine-grained understanding. <a href="https://qwenlm.github.io/blog/qwen2.5-vl-32b/" rel="nofollow">https://qwenlm.github.io/blog/qwen2.5-vl-32b/</a> <a href="https://news.ycombinator.com/item?id=43464068">https://news.ycombinator.com/item?id=43464068</a>Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. <a href="https://qwenlm.github.io/blog/qwen2.5-vl/" rel="nofollow">https://qwenlm.github.io/blog/qwen2.5-vl/</a>One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.

ks2048大约 1 个月前

I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.

评论 #43552076 未加载

评论 #43552641 未加载

评论 #43551756 未加载

评论 #43551752 未加载

pmarreck大约 1 个月前

Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back

评论 #43555254 未加载

daemonologist大约 1 个月前

You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)

评论 #43551259 未加载

fpgaminer大约 1 个月前

I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.

ks2048大约 1 个月前

I've been doing some experiments with the OCR API on macOS lately and wonder how it compares to these LLMs.Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.

AndrewDucker大约 1 个月前

Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.Is there an advantage of using an LLM here?

评论 #43559996 未加载

CSMastermind大约 1 个月前

I've been very impressed with Qwen in my testing, I think people are underestimating it

评论 #43552172 未加载

WillAdams大约 1 个月前

How does one configure an LLM interface using this to process multiple files with a single prompt?

评论 #43552622 未加载

评论 #43552406 未加载

codybontecou大约 1 个月前

Nice work Tyler and team!

ianhawes大约 1 个月前

Is there a reason Surya isn’t included?

sandreas大约 1 个月前

What about mini cpm v2.6?

azinman2大约 1 个月前

News update: OCR company touts new benchmark that shows its own products are the most performant.

评论 #43550853 未加载

评论 #43551091 未加载

评论 #43551815 未加载

13 条评论

jauntywundrkind大约 1 个月前

ks2048大约 1 个月前

评论 #43552076 未加载

评论 #43552641 未加载

评论 #43551756 未加载

评论 #43551752 未加载

pmarreck大约 1 个月前

评论 #43555254 未加载

daemonologist大约 1 个月前

评论 #43551259 未加载

fpgaminer大约 1 个月前

ks2048大约 1 个月前

AndrewDucker大约 1 个月前

Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.Is there an advantage of using an LLM here?

评论 #43559996 未加载

CSMastermind大约 1 个月前

I've been very impressed with Qwen in my testing, I think people are underestimating it

评论 #43552172 未加载

WillAdams大约 1 个月前

How does one configure an LLM interface using this to process multiple files with a single prompt?

评论 #43552622 未加载

评论 #43552406 未加载

codybontecou大约 1 个月前

Nice work Tyler and team!

ianhawes大约 1 个月前

Is there a reason Surya isn’t included?

sandreas大约 1 个月前

What about mini cpm v2.6?

azinman2大约 1 个月前

News update: OCR company touts new benchmark that shows its own products are the most performant.

评论 #43550853 未加载

评论 #43551091 未加载

评论 #43551815 未加载