Image Descriptives by GPT4o

191 pointsby yigitkonur358 months ago

I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.The project is open-source and available on GitHub. Feedback is welcome.

19 comments

Oras8 months ago

While this is a nice development, it’s quite risky parsing documents with LLMs. In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.

评论 #41619134 未加载

评论 #41619138 未加载

评论 #41631316 未加载

zerop8 months ago

I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.

评论 #41616242 未加载

评论 #41617799 未加载

评论 #41619353 未加载

评论 #41621404 未加载

评论 #41616233 未加载

pierre8 months ago

Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:<a href="https://www.llamaindex.ai/blog/introducing-llamaparse-premium" rel="nofollow">https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...</a>)However this model will get better and we may soon have a good pdf to md model.

评论 #41618872 未加载

评论 #41616351 未加载

评论 #41617916 未加载

constantinum8 months ago

There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.<a href="https://unstract.com/llmwhisperer/" rel="nofollow">https://unstract.com/llmwhisperer/</a>LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.<a href="https://github.com/Zipstack/unstract">https://github.com/Zipstack/unstract</a>

jdthedisciple8 months ago

Zerox [0] was featured on here recently and does the exact same thing[0] <a href="https://github.com/getomni-ai/zerox">https://github.com/getomni-ai/zerox</a>

smusamashah8 months ago

I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?That converted NASA doc should be included in repo and linked in readme if you haven't already.

评论 #41617718 未加载

评论 #41615077 未加载

bravura8 months ago

I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture

评论 #41617724 未加载

charlie08 months ago

I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.

TZubiri8 months ago

Ok attempt number 158 at parsing pdfs, here we go, this time surely it will work.

eth0up8 months ago

I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.I know there's a more efficient method, but I don't know more than that.

评论 #41619432 未加载

评论 #41617614 未加载

评论 #41621838 未加载

评论 #41617001 未加载

评论 #41617408 未加载

refulgentis8 months ago

GPT 4o doesn't do actual OCR and there's much smaller and more effective models for specifically this problem.I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.

评论 #41619407 未加载

fzysingularity8 months ago

One nit in the repo README - you might want to change the cost reporting to be as $15 / 1000 pages instead of documents.

KoolKat238 months ago

This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.

jdross8 months ago

How does this compare with commercial OCR APIs on a cost per page?

评论 #41614941 未加载

magicalhippo8 months ago

Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?

评论 #41615091 未加载

评论 #41616465 未加载

评论 #41615510 未加载

devops0008 months ago

You could transform arXiv to a markdown website

scottmcdot8 months ago

Does it do image to MD too?

评论 #41619758 未加载

评论 #41618080 未加载

评论 #41619168 未加载

wittjeff8 months ago

license please?

bschmidt18 months ago

My previous employer needs this.I won't tell them :) :D >:D :|