Why LLMs still have problems with OCR

218 点作者 ritvikpandey213 个月前

Document ingestion and the launch of Gemini 2.0 caused a lot of buzz this week. As a team building in this space, this is something we researched thoroughly. Here’s our take: ingestion is a multistep pipeline, and maintaining confidence from LLM nondeterministic outputs over millions of pages is a problem.

45 条评论

michaelbuckbee3 个月前

I took a picture of a grocery list and then pasted it into ChatGPT to have it written out and it worked flawlessly...until I discovered that I'd messed up the picture when I took it at an angle and had accidentally cut off the first character or two of the bottom half of the list.ChatGPT just inferred that I wanted the actual full names of the items (aka "flour" instead of "our").Depending on how you feel about it, this is either an absolute failure of OCR or wildly useful and much better.

评论 #42978098 未加载

评论 #42977937 未加载

评论 #42978240 未加载

评论 #42977959 未加载

评论 #42982266 未加载

评论 #42993268 未加载

评论 #42982992 未加载

评论 #42980482 未加载

coder5433 个月前

I'm somewhat surprised neither this article nor the previous one mention anything about the Florence-2 model series. I had thought that Florence-2 was not just surprisingly capable for this kind of work, but also easily fine-tunable for a particular kind of document, when you expect to process a lot of instances of that document and want to further optimize accuracy. It's extremely small (0.23B and 0.77B parameters), so it's easy to run, easy to fine-tune, and probably unlikely to overthink things.<a href="https://arxiv.org/abs/2311.06242" rel="nofollow">https://arxiv.org/abs/2311.06242</a><a href="https://huggingface.co/blog/finetune-florence2" rel="nofollow">https://huggingface.co/blog/finetune-florence2</a><a href="https://blog.roboflow.com/florence-2-ocr/">https://blog.roboflow.com/florence-2-ocr/</a><a href="https://www.assemblyai.com/blog/florence-2-how-it-works-how-to-use/" rel="nofollow">https://www.assemblyai.com/blog/florence-2-how-it-works-how-...</a>I don't personally deal with any OCR tasks, so maybe I misread the room, but it sounded promising, and I have seen some continuing interest in it online elsewhere.In addition to the architectural issues mentioned in OP's article that are faced by most SOTA LLMs, I also expect that current SOTA LLMs like Gemini 2.0 Flash aren't being trained with very many document OCR examples... for now, it seems like the kind of thing that could benefit from fine-tuning on that objective, which would help emphasize to the model that it doesn't need to try to solve any equations or be helpful in any smart way.

jll293 个月前

In case any scientist actually working on adaptive OCR is reading this, I was given a post-WWII newspaper archive (PDF scans, 1945-2006, German language) that I would like to OCR with the highest quality, compute demands are not an issue, I've got an army of A100s available.I played with OCR post-correction algorithms an invented on method myself in 1994, but haven't worked in that space since. Initial Tesseract and GPT-4o experiments disappoint. Any pointers (papers, software) & collab. suggestions welcome.

评论 #42982980 未加载

评论 #42981946 未加载

评论 #42977975 未加载

评论 #42978109 未加载

评论 #42978214 未加载

评论 #42982002 未加载

评论 #42977867 未加载

jeswin3 个月前

If Pulse (which is a competing product, the premise of which is threatened by both closed and open models) wants to dispute the post earlier this week, it should provide samples which fail in Claude and Gemini. The image [1] in the post is low-resolution and fuzzy. Claude's user manual specifically says: "Images uploaded on Claude.ai can be up to 30MB, and up to 8000x8000 pixels. We recommend avoiding small or low resolution images where possible."> We have hundreds of examples like this queued up, so let us know if you want some more!Link to it then, let people verify.I've pushed a lot of financial tables through Claude, and it gives remarkable accuracy (99%+) when the text size is legible to a mid-40s person like me. Gpt-4o is far less accurate.[1]: <a href="https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/67a51edca559b69e9663b3b7_AD_4nXf8rt2Tz1Pk_F9oyajrpZscKm4Q5weP9WkNQtsguIpwKfAlw3Q53qJMW1wUxCrI5kZlIX-NXWKVUEoOUKy7Pq2cXbXWmmDT_IqCxBOLai5g6T8tHbVe1KwabmGsBVU56OJCyBzOXA.png" rel="nofollow">https://cdn.prod.website-files.com/6707c5683ddae1a50202bac6/...</a>

评论 #42983821 未加载

评论 #42981918 未加载

评论 #42981882 未加载

password43213 个月前

As opposed to the discussion 2 days ago with 400+ comments:Ingesting PDFs and why Gemini 2.0 changes everything<a href="https://news.ycombinator.com/item?id=42952605">https://news.ycombinator.com/item?id=42952605</a>

评论 #42977492 未加载

评论 #42977541 未加载

mehulashah3 个月前

(CEO of Aryn here: <a href="https://aryn.ai" rel="nofollow">https://aryn.ai</a>)Nice post and response to the previous one.It’s important to remember that the use cases for VLMs and document parsers are often different. VLMs definitely take a different approach than layout detection and OCR. They’re not mutually exclusive. VLMs are adaptable with prompting, eg please pull out the entries related to CapEx and summarize the contributions. Layout parsers and OCR are often used for indexing and document automation. Each will have their own place in an enterprise stack.

snthd3 个月前

>Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Except for a very special kind of bug:<a href="https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning" rel="nofollow">https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...</a>>Xerox scanners/photocopiers randomly alter numbers in scanned documents

thorum3 个月前

This seems like a problem that will quickly fall to the new reinforcement learning methods introduced by DeepSeek. Just build a system to synthetically render a few million pages of insanely complex, hard-to-parse documents with different layouts along with a JSON description of what the correct OCR should be, mix in some human annotated datasets, then do RL against a verifier that insists on 100% accuracy.

评论 #42981045 未加载

osigurdson3 个月前

ChatGPT is also still hilariously bad at drawing diagrams - universally producing a silly cartoon with misspelled words. The rate of improvement over the past two years is effectively zero.

评论 #42988519 未加载

评论 #42983219 未加载

markisus3 个月前

I found this part questionable.> Fixed patch sizes may split individual characters> Position embeddings lose fine-grained spatial relationships, losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.The author suggests that the standard ViT architecture is poorly suited for OCR because patches do not respect character boundaries and that the positional embeddings only embed the locations of patches, which are 16x16 pixels.My mental model is that a token is a memory slot where computation results can be stored or retrieved from. There is no reason why we should want the layout of these memory slots must mimic the layout of the document, except at the very first layer, because then we don't have to think too hard about how to encode the document.

bambax3 个月前

I'm making a simple service that outputs layout-following ASCII from images, PDFs of images or text PDFs. I too think the risk of hallucination is in many cases too great.I fed my system the first image in the post [0] and got the text below in return.I will be looking for beta testers next week... Email if interested!<pre><code> VEH YR MAKE MODEL IDENTIFICATION TYPE SYM ST TER USE CLASS ALARM 2 02 HOND CIVIC EX 1HGEM22952L086006 PP 18 IL 37 L 887120 LOSS PAYEE THAT APPLIES: 2 3.02 HYUN SONATA / GL KMHWF25S72A671544 PP 16 IL 37 P 887120 H NO. COVERAGE DESCRIPTION LIABILITY LIMIT (S) DEDUCTIBLE PREMIUM 2 Preferred Extra Auto Bodily Injury $ 250,000 / $ 500,000 $ 92.00 Property Damage $ 100,000 $ 43.00 Medical Payments $ 5,000 $ 13.00 Uninsured Motorist $ 250,000 / $ 500,000 $ 62.00 Undinsured Motor.-BI $ 250,000 / $ 500,000 INCL Collision $ 500 $ 141.00 Other than Collision $ 250 $ 92.00 TOTAL FOR UNIT 2 $ 443.00 3- Preferred Extra Auto Bodily Injury $ 250,000 / $ 500,000 $ 92.00 Property Damage $ 100,000 $ 43.00 Medical Payments $ 5,000 $ 13.00 Uninsured Motorist $ 250,000 / $ 500,000 $ 62.00 Undinsured Motor. BI $ 250,000 / $ 500,000 INCL Collision $ 500 $ 136.00 Other than Collision $ 250 $ 90.00 TOTAL FOR UNIT 3 $ 436.00 DRIVER INFORMATION DR VEH SEX MAR BIRTH G / S PRIN DVR LIC NO. NAME PTS </code></pre> [0] <a href="https://i.imgur.com/sLWQoFG.jpeg" rel="nofollow">https://i.imgur.com/sLWQoFG.jpeg</a>

bryzaguy3 个月前

Wasn’t seeing what OCR stands for, I believe it’s Optical Character Recognition.

faebi3 个月前

Shouldn't it be easy to generate a lot of OCR data? Generate HTML, randomize, generate image, apply noise and let it train on it.

评论 #42983320 未加载

评论 #42983308 未加载

apt-get3 个月前

Question to anyone with experience in this domain: I have CSAM spam problems on a forum I host, with bots putting link shortener URLs embedded in images rather than the post body. Traditional OCR software deals poorly with them due to font modifications and intentional text edge modifications, and I'm obviously not gonna use a SaaS/closed source model to upload a bunch of may-be-may-not-be-CSAM pictures, so looking for a way to do this locally, with cheapish inference if possible (I don't mind spending a minute of compute to get the result out for one image, but need to do it on the CPU).Is there any small model that would do this effectively, with pure text extraction (without going for any kind of formatting or whatnot)?

评论 #43006728 未加载

评论 #42982344 未加载

wkat42423 个月前

I noticed llama 3.2 8b has big problems reading white on black text. Black on white goes way better. But I think it makes sense. They don't look at text like a dedicated OCR algorithm. I see the article elaborates on the very well.

评论 #42977992 未加载

llm_trw3 个月前

This is a response to: <a href="https://news.ycombinator.com/item?id=42952605">https://news.ycombinator.com/item?id=42952605</a>A fun threat to read for the current hype cycle.You can tell who is working in the field by the fact they don't use VLMs for OCR and who isn't because they think it's a solved problem.A question to the authors.Do you have resources to train any VLMs from scratch? They aren't quite the bests the sota LLMs are and I think they can be made a lot more useful with:1). Better training data.2). Larger vision parts of the model.In short: 2d attention is not something that anyone's doing at scale - that I know of - and is a no brainer for understanding images.

评论 #42979839 未加载

kyriakos3 个月前

I find that LLMs can read text off product label photos I can't even read myself.

评论 #42977753 未加载

评论 #42978005 未加载

julienchastang3 个月前

I've had limited but good experience (with both English and French text) with Tesseract, then getting ChatGPT to fix problems with clever prompting (e.g., pretend you are an expert OCR corrector, blah blah, blah).

评论 #42977989 未加载

__rito__3 个月前

I was just trying a bunch of models for OCR. I only have 4 GB of VRAM in my personal machine.My goal was to run an OCR model locally and extract text from scanned PDFs.Many models could not even be run. Among those that did run, thanks to Ollama, provided very poor experience. Like llava-llama3, phi3.5 vision, etc.What worked really well, but still not up to the mark- Surya [0].It works perfectly on screenshots from true text PDFs, but not from scanned PDFs. Also has much better performance for English than Indian languages.[0]: <a href="https://github.com/VikParuchuri/surya">https://github.com/VikParuchuri/surya</a>

评论 #42979847 未加载

edanm3 个月前

I'd just like to say this is a fantastic "marketing" blog post. Great explanation of an interesting problem, that this company theoretically helps solve. Very well done!One note - there was a callout at the end to "stay tuned" for a follow-up post about the actual solution. I may have missed it, but I don't see any way to actually sign up to the blog or newsletter or anything. That's a shame - I'd love to follow this topic and product (and potentially have a few real-world use cases for it).

评论 #42981458 未加载

practice93 个月前

I tried the square example from the paper mentioned with o1-pro and it had no problem counting 4 nested squares…And the 5 square variation as well.So perhaps it is just a question of how much compute you are willing to throw at it

评论 #42978049 未加载

pilooch3 个月前

It's good and useful to see empirical analyses like this. I use open & custom VLMs a lot. The point of VLMs is that OCR is not needed anymore: it's intrinsic to the model. For instance at work we've developed a family vision-based RAG, and it's performance is twice that of a text-based one. The point I'd like to make here is that OCR is an intermediate step that is not explicitly needed anymore, un many cases. My hunch is that pure OCR will go away.

nicodjimenez3 个月前

Check out mathpix.com we have a hybrid approach towards OCR that features accurate layout understanding (with accurate bounding boxes) plus accurate OCR outputs.Disclaimer: I'm the founder and CEO.

levocardia3 个月前

LLMs do not struggle at all with raw text: they never lose decimal places or drop digits when transcribing a table from raw text. So the problem is not the internal representation. I do this all the time and all major LLMs work eminently well at it.The problem comes from the vision part. Either (a) the ViT architecture needs a rework, or (b) the vision models need more training on tasks of the "copy this" nature versus the "do this" nature.

评论 #42979866 未加载

评论 #42981278 未加载

uri_merhav3 个月前

There's lots of hidden gotchas to this. Uploading a screenshot and asking an LLM to transcribe one page is generally ok. Give it a table that spans pages, or a 60 page doc, and you're in dire straits.I cofounded DocuPanda to handle this issue specifically. Call me biased, but I do believe it's the best solution out there.

Zufriedenheit3 个月前

Is there an OCR arena out there, similar to lmarena? Would be very useful but couldn't find one yet.

WhitneyLand3 个月前

>>When an LLM processes a document image, it first embeds it into a high-dimensional vector space through the attention mechanism…This is a confusing way to describe attention and gets a bit off topic, the attention mechanism is not really what’s causing any of the issues in the article.

m3kw93 个月前

You don’t really feed images to LLMs, rather to a vision model within the multi modal llm

评论 #42978064 未加载

martingoodson3 个月前

I've worked in data extraction from documents for a decade and have developed algorithms in the space. I've developed a product using LLMs for this purpose too.This article is essentially correct.

评论 #42984879 未加载

gieksosz3 个月前

I just tried the rectangle test on 4o and it answered correctly.

jmartin26833 个月前

We use Claude 3.5 sonnet to OCR and structure tabular data from PDFs and it’s virtually flawless… orders of magnitude better than Textract (or pretty much any other LLM).

jrochkind13 个月前

LLMs seem to be really good at audio speech to text though. One would naively think these are similar problems, but apparently not?

mycall3 个月前

Ripcord demo'd their stack to me yesterday and the use of LLMs works great for OCR, so it is indeed possible.

akkad333 个月前

I use Chatgpt to convert tables in fng and pdfs to pandas data frames and it works very well

iwangulenko3 个月前

Resume parsing is a problem for decades,and even today it can never be done rightbecause SOME resumes are just so f** up.

lazyeye3 个月前

Is this just a training issue? They just need to train a model specifically for OCR?

评论 #42978017 未加载

评论 #42978370 未加载

callamdelaney3 个月前

To be fair, they would say that due to the fact they are selling a competing thing.

salimmahboubi3 个月前

To me, the question is why we keep using PDFs that never get printed?

jebarker3 个月前

s/LLMs/VLMs/g

fpgaminer3 个月前

A lot of problems jump out to me with this article, particularly with the explanation of multi-modal LLMs. I'll say that I _do_ agree with the thrust of the article. Don't trust LLMs. But they probably should have argued legitimate issues with VLM based OCR, rather than try to talk about how VLMs are somehow fundamentally flawed or something.> LLMs process images through high-dimensional embeddings, essentially creating abstract representations that prioritize semantic understanding over precise character recognition.This isn't true. CLIP and its derivatives don't prioritize semantic understanding. They are trained contrastively, which (very roughly speaking) means they need to be able to differentiate similar images. If two images are just white with a few words, the only way to differentiate them is to include the text in the embedding.Pretrained CLIP models do tend to be a bit lossy in this department, but not by as much as you would think considering they boil an entire image down to something on the order of 768 floats.> Each step in this pipeline optimizes for semantic meaning while discarding precise visual information.Again, that ... doesn't make any sense. It's a bit foolhardy to even say _what_ the models do, given that not even the most brilliant ML researchers know. But in broad _hypothesis_, the CLIP pipeline is optimizing being able to pair images with captions amongst a large number of possibilities. Which, again, requires them to surface all kinds of information from the image, and often times requires surfacing specific text from the image. How else would it differentiate powerpoint slides? Math problems in images? Etc.> Fixed patch sizes may split individual charactersThis doesn't matter. We know from empirical evidence. But even if it _did_, there's plenty of vision models that use overlapping patches.> Position embeddings lose fine-grained spatial relationshipsThis isn't true. The model is fully aware of the position of pixels within patches, and the position embedding is merely to tell it the position of the patches themselves within the image. Therefore it can derive the absolute position of every pixel, if it needs to. In fact, we have proof they can and do.> losing the ability to have human-in-the-loop evaluations, confidence scores, and bounding box outputs.You get confidence scores for free because the model is explicitly trained to provide cosine similarity scores.OWLv2 is a CLIP based open vocabulary bounding box model (from Google, makers of Gemini). It's finetuned from a standard, pretrained CLIP model. Nothing really special about the vision architecture; just that it gets finetuned to output bounding boxes. And it beats the pants off YOLO while being open vocabulary to boot. So not only are CLIP-like models capable of outputting bounding boxes, but OWLv2 was trained with human-in-the-loop processes and outputs confidence scores.Oh and there's Florence, which is a VLM trained on bounding boxes.> Favor common words over exact transcriptionNothing about LLMs indicates that. In fact, pretrained LLMs favor exact transcription.> "Correct" perceived errors in the source documentWhich OCR systems need to do to be useful for many applications. I get the argument that LLMs are a blackbox in this regard, which is a legitimate criticism, but correcting mistakes is not fundamentally the issue. It's better to say that LLMs _blindly_ correct issues. Whereas, perhaps, one could say a traditional OCR system can report "this is my exact transcription, I corrected it to this" and have various knobs to tweak thresholds. But there's no reason VLMs can't do that too.> Merge or reorder information based on learned patternsLLMs are perfectly capable of regurgitating data verbatim. That's perhaps the first thing they learn to do to get loss down. That's what all long context models are benchmarked against.> Produce different outputs for the same input due to samplingYou can turn off sampling, and then they are deterministic. Or you can output the logits to the user, which gives you effectively confidence scores on its transcription.And a well trained LLM for this task isn't really "probabilistic" in the sense that its outputs are completely different each time. If it's trained and prompted specifically to transcribe a document, that's what it's going to do. Any variations in output at that point are a result of real vagaries either in the document, vision, or the user request.If a user wants consistency, they merely need to ask for it. Or the VLM needs to be trained better. In either case, these models are _capable_ of it.It's most important to note here that, outside of pretrained LLMs, all LLMs that users interact with are Reinforcement trained. So while they were next token prediction trained during _pretraining_, they get trained to seek reward in production. That vastly trims the logits and focuses the model explicitly on performing tasks. Well trained, production LLMs only really put probability fields around tokens that are legitimately valid for the task at hand (bounded by the LLM's intelligence, of course).> Unlike traditional OCR systems that fail obviously when uncertain, LLMs make educated guesses that appear plausible but may be entirely wrong.Consider the sequence "rn" versus "m". To a human reader scanning quickly, or an LLM processing image patches, these can appear nearly identical. The model, trained on vast amounts of natural language, will tend toward the statistically more common "m" when uncertain.Again, LLMs don't just regurgitate the most "common" stuff. They are context specific. Besides, it's the vision module that would be making the differentiation here between rn and m. A vision module that is likely neither better nor worse than the vision modules traditional OCR systems are using. (Of course, the LLM may process the vision module's output and notice that perhaps it mis-transcribed "rn" vs "m" and "correct" it. But correct it based on _context_ not on some simplistic statistical model as suggested.)> There’s a great paper from July 2024 (millennia ago in the world of AI) titled “Vision language models are blind” that emphasizes shockingly poor performance on visual tasks a 5 year old could doAbsolutely. I work in this field, and these vision models are not at the same level as their language counterparts. Due in large part to a lack of good data, good training processes, and good benchmarks. The Cambrian-1 paper is quite insightful here, as it studies the vision benchmarks themselves (<a href="https://arxiv.org/abs/2406.16860" rel="nofollow">https://arxiv.org/abs/2406.16860</a>). The TLDR is that most of the vision benchmarks are actually just text benchmarks, and performance barely degrades when the model is blinded. I've found the same to be true of almost all publicly available training datasets for vision models, which is likely why these models don't learn good, robust visual understandings.That doesn't really speak to the fundamental capabilities of the vision models. It speaks to the lack of training them well. So, if a model is explicitly trained to do OCR using lots of high quality ground truth data (which is easy to get and generate), then their performance can, and does, excel.---Now, all of that said, I also don't agree with the prior post this post is in response to. I work with VLMs a lot as part of my research, and I can assure you that they are nowhere near human level on OCR. They can exceed human performance in very specific tasks at the moment, but that's about it.Are they better than other OCR offerings? As of this moment, I would tend to trust someone who does OCR for a living, so if Pulse says VLMs aren't as good as their solution, I would probably trust that over someone else saying VLMs work for their specific application. And VLMs _absolutely_ come with a myriad of caveats. They aren't as reliable as a more mechanical OCR system. Expect something like GPT4o to completely glitch 1 in every 10,000 queries. And expect them to be "weird". GPT4o will tend to not fully follow instructions maybe 1 in 100 times, so you might get your document back in the wrong format, or have "Sure, I can help with that!" at the start of your document, etc. Gemini tends to have better instruction following, but I don't have a good assessment of its reliability yet.If I, personally, had a small project that needed OCR, I'd use Tesseract if it's just PDFs or something like that with printed text. If it's something with weird fonts, fancy stuff, handwriting, math formulas, etc. I might give Gemini a try. If it's mission critical, pay an expert to do it, whether that's in-house or paying a service explicitly built for the purpose.---NOTE: One thing that got glossed over in the article is that VLMs are not trained on the "embeddings" of the vision model, per se. CLIP processes the images as N number of tokens across L number of layers. At the end, you have N embeddings. For traditional CLIP, the last (or first) embedding is used as the result. Modern CLIPs average the embeddings together. Tomato, tomato.VLMs are not trained on that single embedding from CLIP. The "head" gets stripped off, and the VLMs get trained on all N processed tokens from CLIP. So they have access to much more information. The vision models also get finetuned during the training of the VLM, and, importantly, CLIP architectures use skip connections throughout. So there is a direct path for the LLM to access pretty much anything from the vision model that it needs, and optimize for any information it needs.The size of the embedded information given to the LLM, then, is almost about the same as the number of pixels from the source image. For example it might be something like a 384x384x3 image (442,368 dimensions) getting baked down into something like a 150,000 dimensional vector. So it's really not a fundamentally lossy process at that point.

评论 #42982472 未加载

评论 #42982243 未加载

2-3-7-43-18073 个月前

i dont understand. what have llms to do with ocr?

评论 #42980465 未加载

8338550bff963 个月前

February 6, 2024... okay grandpa

评论 #42977941 未加载

评论 #42977972 未加载

rhavaei3 个月前

very nice blogpost.

codingwagie3 个月前

Really? I have been using 4o, and its flawless at OCR

评论 #42978059 未加载

评论 #42978163 未加载

评论 #42978511 未加载

ritvikpandey213 个月前