I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.<p>In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.<p>The project is open-source and available on GitHub. Feedback is welcome.
While this is a nice development, it’s quite risky parsing documents with LLMs.
In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.<p>As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.<p>The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.
I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.
Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).<p>The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:<a href="https://www.llamaindex.ai/blog/introducing-llamaparse-premium" rel="nofollow">https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...</a>)<p>However this model will get better and we may soon have a good pdf to md model.
There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.<p>As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.<p><a href="https://unstract.com/llmwhisperer/" rel="nofollow">https://unstract.com/llmwhisperer/</a><p>LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.<p><a href="https://github.com/Zipstack/unstract">https://github.com/Zipstack/unstract</a>
Zerox [0] was featured on here recently and does the exact same thing<p>[0] <a href="https://github.com/getomni-ai/zerox">https://github.com/getomni-ai/zerox</a>
I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?<p>That converted NASA doc should be included in repo and linked in readme if you haven't already.
I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.<p>I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.<p>I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.<p>I know there's a more efficient method, but I don't know more than that.
GPT 4o doesn't do <i>actual</i> OCR and there's much smaller and more effective models for specifically this problem.<p>I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.<p>At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.
This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.<p>I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.
Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?