Ingesting PDFs and why Gemini 2.0 changes everything

1303 点作者 serjester3 个月前

97 条评论

I work in fintech and we replaced an OCR vendor with Gemini at work for ingesting some PDFs. After trial and error with different models Gemini won because it was so darn easy to use and it worked with minimal effort. I think one shouldn't underestimate that multi-modal, large context window model in terms of ease-of-use. Ironically this vendor is the best known and most successful vendor for OCR'ing this specific type of PDF but many of our requests failed over to their human-in-the-loop process. Despite it not being their specialization switching to Gemini was a no-brainer after our testing. Processing time went from something like 12 minutes on average to 6s on average, accuracy was like 96% of that of the vendor and price was significantly cheaper. For the 4% inaccuracies a lot of them are things like the text "LLC" handwritten would get OCR'd as "IIC" which I would say is somewhat "fair". We probably could improve our prompt to clean up this data even further. Our prompt is currently very simple: "OCR this PDF into this format as specified by this json schema" and didn't require some fancy "prompt engineering" to contort out a result.Gemini developer experience was stupidly easy. Easy to add a file "part" to a prompt. Easy to focus on the main problem with weirdly high context window. Multi-modal so it handles a lot of issues for you (PDF image vs. PDF with data), etc. I can recommend it for the use case presented in this blog (ignoring the bounding boxes part)!

评论 #42957551 未加载

评论 #42956937 未加载

评论 #42955824 未加载

评论 #42957231 未加载

评论 #42953745 未加载

评论 #42956650 未加载

评论 #42953799 未加载

评论 #42954088 未加载

评论 #42955520 未加载

评论 #42961030 未加载

评论 #43000649 未加载

评论 #42958869 未加载

评论 #42964161 未加载

评论 #42989066 未加载

评论 #42966080 未加载

评论 #42958152 未加载

评论 #42959364 未加载

评论 #42959695 未加载

评论 #42958534 未加载

评论 #42961554 未加载

评论 #42959887 未加载

评论 #42960954 未加载

评论 #42965420 未加载

评论 #42957905 未加载

评论 #42960847 未加载

评论 #42963981 未加载

评论 #42954472 未加载

评论 #42957624 未加载

评论 #42953680 未加载

评论 #42962009 未加载

评论 #42955083 未加载

评论 #42958555 未加载

评论 #42955470 未加载

llm_trw3 个月前

This is using exactly the wrong tools at every stage of the OCR pipeline, and the cost is astronomical as a result.You don't use multimodal models to extract a wall of text from an image. They hallucinate constantly the second you get past perfect 100% high-fidelity images.You use an object detection model trained on documents to find the bounding boxes of each document section as _images_; each bounding box comes with a confidence score for free.You then feed each box of text to a regular OCR model, also gives you a confidence score along with each prediction it makes.You feed each image box into a multimodal model to describe what the image is about.For tables, use a specialist model that does nothing but extract tables—models like GridFormer that aren't hyped to hell and back.You then stitch everything together in an XML file because Markdown is for human consumption.You now have everything extracted with flat XML markup for each category the object detection model knows about, along with multiple types of probability metadata for each bounding box, each letter, and each table cell.You can now start feeding this data programmatically into an LLM to do _text_ processing, where you use the XML to control what parts of the document you send to the LLM.You then get chunking with location data and confidence scores of every part of the document to put as meta data into the RAG store.I've build a system that read 500k pages _per day_ using the above completely locally on a machine that cost $20k.

评论 #42956087 未加载

评论 #42962387 未加载

评论 #42956619 未加载

评论 #42958962 未加载

评论 #42958781 未加载

评论 #42960927 未加载

评论 #42961613 未加载

评论 #42959394 未加载

评论 #42957414 未加载

评论 #42962243 未加载

评论 #42961296 未加载

评论 #42965540 未加载

评论 #42956265 未加载

评论 #42960744 未加载

评论 #42956247 未加载

评论 #42983927 未加载

评论 #42955515 未加载

twelve403 个月前

Isn't it amazing how one company invented a universally spread format that takes structured data from an editor (except images obviously) and converts it into a completely fucked-up unstructured form that then requires expensive voodoo magic to convert back into structured data.

评论 #42956861 未加载

评论 #42963880 未加载

评论 #42960357 未加载

评论 #42956872 未加载

评论 #42961573 未加载

评论 #42967188 未加载

silverliver3 个月前

We are driving full speed into a xerox 2.0 moment and this time we are doing so knowingly. At least with xerox, the errors were out of place and easy to detect by a human. I wonder how many innocent people will lose their lives or be falsely incarcerated because of this. I wonder if we will adapt our systems and procedures to account for hallucinations and "85%" accuracy.And no, outlawing use the use of AI or increasing liability with its use will have next to nothing to deter its misuse and everyone knows it. My heart goes out to the remaining 15%.

评论 #42960051 未加载

评论 #42960205 未加载

评论 #42965231 未加载

评论 #42960613 未加载

评论 #42963163 未加载

freezed83 个月前

(disclaimer I am CEO of llamaindex, which includes LlamaParse)Nice article! We're actively benchmarking Gemini 2.0 right now and if the results are as good as implied by this article, heck we'll adapt and improve upon it. Our goal (and in fact the reason our parser works so well) is to always use and stay on top of the latest SOTA models and tech :) - we blend LLM/VLM tech with best-in-class heuristic techniques.Some quick notes: 1. I'm glad that LlamaParse is mentioned in the article, but it's not mentioned in the performance benchmarks. I'm pretty confident that our most accurate modes are at the top of the table benchmark - our stuff is pretty good.2. There's a long tail of issues beyond just tables - this includes fonts, headers/footers, ability to recognize charts/images/form fields, and as other posters said, the ability to have fine-grained bounding boxes on the source elements. We've optimized our parser to tackle all of these modes, and we need proper benchmarks for that.3. DIY'ing your own pipeline to run a VLM at scale to parse docs is surprisingly challenging. You need to orchestrate a robust system that can screenshot a bunch of pages at the right resolution (which can be quite slow), tune the prompts, and make sure you're obeying rate limits + can retry on failure.

评论 #42961205 未加载

评论 #42961499 未加载

评论 #42960910 未加载

评论 #42957169 未加载

评论 #42962147 未加载

评论 #42961979 未加载

rjurney3 个月前

I've been using NotebookLM powered by Gemini 2.0 for three projects and it is _really powerful_ for comprehending large corpuses you can't possibly read and thinking informed by all your sources. It has solid Q&A. When you ask a question or get a summary you like [which often happens] you can save it as a new note, putting it into the corpus for analysis. In this way your conclusions snowball. Yes, this experience actually happens and it is beautiful.I've tried Adobe Acrobat AI for this and it doesn't work yet. NotebookLM is it. The grounding is the reason it works - you can easily click on anything and it will take you to the source to verify it. My only gripe is that the visual display of the source material is _dogshit ugly_, like exceptionally so. Big blog pink background letters in lines of 24 characters! :) It has trouble displaying PDF columns, but at least it parses them. The ugly will change I'm sure :)My projects are setup to let me bridge the gaps between the various sources and synthesize something more. It helps to have a goal and organize your sources around that. If you aren't focused, it gets confused. You lay the groundwork in sources and it helps you reason. It works so well I feel _tender_ towards it :) Survey papers provide background then you add specific sources in your area of focus. You can write a profile for how you would like NotebookLM to think - which REALLY helps out.They are:* The Stratigrapher - A Lovecraftian short story about the world's first city. All of Seton Lloyd/Faud Safar's work on Eridu. Various sources on Sumerian culture and religion All of Lovecraft's work and letters. Various sources about opium Some articles about nonlinear geometries* FPGA Accelerated Graph Analytics An introduction to Verilog Papers on FPGAs and graph analytics Papers on Apache Spark architecture Papers on GraphFrames and a related rant I created about it and graph DBs A source on Spark-RAPIDS Papers on subgraph matching, graphlets, network motifs Papers on random graph models* Graph machine learning notebook without a specific goal, which has been less successful. It helps to have a goal for the project. It got confused by how broad my sources were.I would LOVE to share my projects with you all, but you can only share within a Google Workspaces domain. It will be AWESOME when they open this thing up :)

anirudhb993 个月前

thanks a ton for all the amazing feedback on this thread! if(a) you have document understanding use cases that you'd like to use gemini for (the more aspirational the better) and/or(b) there are loss cases for which gemini doesn't work well today,please feel free to email anirudhbaddepu@google.com and we'd love to help get your use case working & improve quality for our next series of model updates!

评论 #42960592 未加载

rudolph93 个月前

We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse.<a href="https://tika.apache.org/" rel="nofollow">https://tika.apache.org/</a>

评论 #42956361 未加载

gapeslape3 个月前

In my mind, Gemini 2.0 changes everything because of the incredibly long context (2M tokens on some models), while having strong reasoning capabilities.We are working on compliance solution (<a href="https://fx-lex.com" rel="nofollow">https://fx-lex.com</a>) and RAG just doesn’t cut it for our use case. Legislation cannot be chunked if you want the model to reason well about it.It’s magical to be able to just throw everything into the model. And the best thing is that we automatically benefit from future model improvements along all performance axes.

评论 #42959708 未加载

评论 #42957222 未加载

galvin3 个月前

Somewhat tangential, but the EU has a directive mandating electronic invoicing for public procurement.One of the standards that has come out of that is EN 16931, also known as ZUGFeRD and Factur-X, which basically involves embedding an XML file with the invoice details inside a PDF/A. It allows the PDF to be used like a regular PDF but it also allows the government procurement platforms to reliably parse the contents without any kind of intelligence.It seems like a nice solution that would solve a lot of issues with ingesting PDFs for accounting if everyone somehow managed to agree a standard. Maybe if EN 16931 becomes more broadly available it might start getting used in the private sector too.

jbarrow3 个月前

> Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxesQwen2.5 VL was trained on a special HTML format for doing OCR with bounding boxes. [1] The resulting boxes aren't quite as accurate as something like Textract/Surya, but I've found they're much more accurate than Gemini or any other LLM.[1] <a href="https://qwenlm.github.io/blog/qwen2.5-vl/" rel="nofollow">https://qwenlm.github.io/blog/qwen2.5-vl/</a>

fngjdflmdflg3 个月前

>Unfortunately Gemini really seems to struggle on this, and no matter how we tried prompting it, it would generate wildly inaccurate bounding boxesThis is what I have found as well. From what I've read, LLMS do not work well with images for specific details due to image encoders which are too lossy. (No idea if this is actually correct.) For now I guess you can use regular OCR to get bounding boxes.

评论 #42953670 未加载

kbyatnal3 个月前

It's clear that OCR & document parsing are going to be swallowed up by these multimodal models. The best representation of a document at the end of the day is an image.I founded a doc processing company [1] and in our experience, a lot of the difficulty w/ deploying document processing into production is when accuracy requirements are high (> 97%). This is because OCR and parsing is only one part of the problem, and real world use cases need to bridge the gap between raw outputs and production-ready data.This requires things like:- state-of-the-art parsing powered by VLMs and OCR- multi-step extraction powered by semantic chunking, bounding boxes, and citations- processing modes for document parsing, classification, extraction, and splitting (e.g. long documents, or multi-document packages)- tooling that lets nontechnical members quickly iterate, review results, and improve accuracy- evaluation and benchmarking tools- fine-tuning pipelines that turn reviewed corrections —> custom modelsVery excited to get test and benchmark Gemini 2.0 in our product, very excited about the progress here.[1] <a href="https://extend.app/">https://extend.app/</a>

评论 #42955931 未加载

评论 #42959543 未加载

__jl__3 个月前

The numbers in the blog post seem VERY inaccurate.Quick calculation: Input pricing: Image input in 2.0 Flash is $0.0001935. Let's ignore the prompt. Output pricing: Let's assume 500 token per page, which is $0.0003Cost per page: $0.0004935That means 2,026 pages per dollar. Not 6,000!Might still be cheaper than many solutions but I don't see where these numbers are coming from.By the way, image input is much more expensive in Gemini 2.0 even for 2.0 Flash Lite.Edit: The post says batch pricing, which would be 4k pages based on my calculation. Using batch pricing is pretty different though. Great if feasible but not practical in many contexts.

评论 #42954641 未加载

GGByron3 个月前

I've not followed the literature very closely for some time - what problem are they trying to solve in the first place? They write "for documents to be effectively used in RAG pipelines, they must be split into smaller, semantically meaningful chunks". Segmenting each page by paragraphs doesn't seem like a particularly hard vision problem, nor do I see why an OCR system would need to incorporate an LLM (which seem more like a demonstration of overfitting than a "language model" in any literal sense, going by ChatGPT). Perhaps I'm just out of the loop.Finally, I must point out that statements in the vein of "Why [product] 2.0 Changes Everything" are more often than not a load of humbug.

beklein3 个月前

Great article, I couldn't find any details about the prompt... only the snippets of the `CHUNKING_PROMPT` and the `GET_NODE_BOUNDING_BOXES_PROMPT`.Is there is any code example with a full prompt available from OP, or are there any references (such as similar GitHub repos) for those looking to get started within this topic?Your insights would be highly appreciated.

pmarreck3 个月前

I have some out-of-print books that I want to convert into nice pdf's/epubs (like, reference-quality)1) I don't mind destroying the binding to get the best quality. How do I do so?2) I have a multipage double-sided scanner (fujitsu scansnap). would this be sufficient to do the scan portion?3) Is there anything that determines the font of the book text and reproduces that somehow? and that deals with things like bold and italic and applies that either as markdown output or what have you?4) how do you de-paginate the raw text to reflow into (say) an epub format that will paginate based on the output device specification?

roywashere3 个月前

I think it is very ironic that we chose to use PDF in many fields to archive data because it is a standard and because we would be able to open our pdf documents in 50 or 100 years time. So here we are just a couple of years later facing the challenge of getting the data out of our stupid PDF documents already!

评论 #42956228 未加载

ChrisArchitect3 个月前

Related:Gemini 2.0 is now available to everyone<a href="https://news.ycombinator.com/item?id=42950454">https://news.ycombinator.com/item?id=42950454</a>

rjcrystal3 个月前

I work in healthcare domain, We've had great success converting printed lab reports (95%) to Json format using 1.5-Flash model. This post is really exciting for me. will definitely try out 2.0 models.The struggle which almost every ocr usecase faces is with handwritten documents(doctor prescriptions with bad handwriting) With gemini 1.5 flash we've had ~75-80% percent accuracy (based on random sampling by pharmacists). we're planning to improve this further by fine-tuning gemini models with medical data.What could be other alternative services/models for accurate handwriting ocr?

评论 #42960475 未加载

评论 #42960899 未加载

erulabs3 个月前

Hrm I've been using a combo of Textract (for bounding boxes) AI for understanding the contents of the document. Textract is excellent at bounding boxes and exact-text capture, but LLMs are excellent at understanding when a messy/ugly bit of a form is actually one question, or if there are duplicate questions etc.Correlating the two (Textract <-> AI) output is difficult, but another round of AI is usually good at that. Combined with some text-different scoring and logic, I can get pretty good full-document understanding of questions and answer locations. I've spent a pretty absurd amount of time on this and as of yet have not launched a product with it, but if anyone is interested I'd love to chat about the pipeline!

Havoc3 个月前

Been toying with the flash model. Not the top model, but think it'll see plenty use due to the details. Wins on things other than top of benchmark logs* Generous free tier* Huge context window* Lite version feels basically instantHowever* Lite model seems more prone to repeating itself / looping* Very confusing naming e.g. {model}-latest worked for 1.5 but now its {model}-001? The lite has a date appended, the non-lite does not. Then there is exp and thinking exp...which has a date. wut?

评论 #42953462 未加载

anonu3 个月前

Ingesting PDFs accurately is a noble goal which will no doubt be solved as LLMs get better. However, I need to point out that the financial statement example used in the article already has a solution: iXBRL.Many financial regulators require you to publish heavily marked up statements with iXBRL. These markups reveal nuances in the numbers that OCRing a post processed table will not understand.Of course, financial documents are a narrow subset of the problem.Maybe the problem is with PDF as a format: Unfortunately PDFs lose that meta information when they are built from source documents.I can't help but feel that PDFs could probably be more portable as their acronym indicates.

评论 #42964996 未加载

xnx3 个月前

Glad Gemini is getting some attention. Using it is like a superpower. There are so many discussions about ChatGTP, Claude, DeepSeek, Llama, etc. that don't even mention Gemini.

评论 #42956190 未加载

评论 #42955696 未加载

评论 #42955982 未加载

tomasello773 个月前

I tried using Gemini 2.0 Flash for PDF-to-Markdown parsing of scientific papers after having good results with GPT-4o, but the experience was terrible.When I sent images of PDF page with extracted text, Gemini mixed headlines with body text, parsed tables incorrectly, and sometimes split tables—placing one part at the top of the page and the rest at the bottom. It also added random numbers (like inserting an “8” for no reason).When using the Gemini SDK to process full PDFs, Gemini 1.5 could handle them, but Gemini 2.0 only processed the first page. Worse, both versions completely ignored tables.Among the Gemini models, 1.5 Pro performed the best, reaching about 80% of GPT-4o’s accuracy with image parsing, but it still introduced numerous small errors.In conclusion, no Gemini model is reliable for PDF-to-Markdown parsing and beyond the hype - I still need to use GPT-4o.

oedemis3 个月前

there is also <a href="https://ds4sd.github.io/docling/" rel="nofollow">https://ds4sd.github.io/docling/</a> from ibm research which is mit license and track bounding boxes as rich json format

评论 #42956458 未加载

nsmurali3 个月前

I have seen no decent program that can read, OCR, and analyze, and tabulate data correctly from very large PDF files with a lot of scanned information from different sources. I run my practice with pdf files- one for each patient. It is a treasure trove of actionable data. PDF filing in this manner allows me to finish my daily tasks in 4 hrs instead of 12 hrs! For sick patients who need information at the point of care, PDF has numerous advantages over usual hospital EHR portals, etc. If any smart Engineer/s are interested in working with me, please connect with me

评论 #42961935 未加载

bt33 个月前

One major takeaway that matches my own investigation is that Gemini 2.0 still materially struggles with bounding boxes on digital content. Google has published[1] some great material on spatial understanding and bounding boxes on photography, but identifying sections of text or digital graphics like icons in a presentation is still very hit and miss.--[1]: <a href="https://github.com/google-gemini/cookbook/blob/a916686f95f43aaef200875ac7174082dbdc4e76/quickstarts/Spatial_understanding.ipynb">https://github.com/google-gemini/cookbook/blob/a916686f95f43...</a>

评论 #42953840 未加载

eviks3 个月前

What would change "everything" is if we managed to switch to "real" digital parseable formats instead of this dead tree emulation that buries all data before the arrival of AI...

scottydelta3 个月前

This is what I am trying to figure out how to solve.My problem statement is:- Injest PDFs, summarize, and extract important information.- Have some way to overlay the extracted information on the pdf in the UI.- User can provide feedback on the overlaid info by accepting or rejecting the highlights as useful or not.- This info goes back in to the model for reinforced learning.Hoping to find something that can make this more manageable.

评论 #42953907 未加载

评论 #42953630 未加载

memhole3 个月前

I’ve been very reluctant to use closed source LLMs. This might actually convince me to use one. I’ve done so many attempts at pdf parsing over the years. It’s awful to deal with. 2 column format omg. Most don’t realize that pdfs contain instructions for displaying the document and the content is buried in there. It’s just always been a problematic format.So if it works, I’d be a fool not to use it.

minimalengineer3 个月前

Two years ago, I worked for a company that had its own proprietary AI system for processing PDFs. While the system handled document ingestion, its real value was in extracting and analyzing data to provide various insights. However, one key requirement was rendering documents in HTML with as close to a 1:1 likeness as possible.At the time, I evaluated multiple SDKs for both OCR and non-OCR PDF conversions, but none matched the accuracy of Adobe Acrobat’s built-in solution. In fact, at one point (don’t laugh), the company resorted to running Adobe Acrobat on a Windows machine with automation tools to handle the conversion. Using Adobe’s cloud service for conversion was not an option due to the proprietary nature of the PDFs. Additionally, its results were inconsistent and often worse compared to the desktop version of Adobe Acrobat!Given that experience, I see this primarily as an HTML/text conversion challenge. If Gemini 2.0 truly improves upon existing solutions, it would be interesting to see a direct comparison against popular proprietary tools in terms of accuracy.

diptanu3 个月前

We started with using LLMs for parsing at Tensorlake (<a href="https://docs.tensorlake.ai" rel="nofollow">https://docs.tensorlake.ai</a>), tried Qwen, Gemini, OpenAI, pretty much everything under the sun. My thought was we could skip 5-6 years of development IDP companies have done on specialized models by going to LLMs.On information dense pages, LLMs often hallucinate half of the times, they have trouble understanding empty cells in tables, doesn't understand checkboxes, etc.We had to invest heavily into building a state of the art layout understanding model and finally a table structure understanding for reliability. LLMs will get there, but there are some ways to go there.Where they do well is in VQA type use cases, ask a question, very narrowly scoped, they will work much better than OCR+Layout models, because they are much more generalizable and flexible to use.

mehulashah3 个月前

(Disclosure, CEO of Aryn (<a href="https://aryn.ai/" rel="nofollow">https://aryn.ai/</a>) here)Good post. VLM models are improving and Gemini 2.0 definitely changes the doc prep and ingestion pipeline across the board.What we're finding as we work with enterprise customers:1. Attribution is super important, and VLMs are there yet. Combining them with layout analysis makes for a winning combo.2. VLMs are great at prompt-based extraction, but if you have document automation and you don't know where in tables you'll be searching or need to reproduce faithfully -- then precise table extraction is important.3. VLMs will continue to get better, but the price points are a result of economies of scale that document parsing vendors don't get. On the flip side, document parsing vendors have deployment models that Gemini can't reach.

cccybernetic3 个月前

Shameless plug: I'm working on a startup in this space.But the bounding box problem hits close to home. We've found Unstructured's API gives pretty accurate box coordinates, and with some tweaks you can make them even better. The tricky part is implementing those tweaks without burning a hole in your wallet.

评论 #43081688 未加载

amai3 个月前

Better have a look at- <a href="https://mathpix.com/" rel="nofollow">https://mathpix.com/</a>- Docling : <a href="https://ds4sd.github.io/docling/" rel="nofollow">https://ds4sd.github.io/docling/</a>

ThinkBeat3 个月前

Hmm I have been doing a but if this manually lately for a personal project. I am working on some old books that are far past any copyright, but they are not available anywhere on the net. (Being in Norwegian m makes a book a lot more obscure) so I have been working on creating ebooks out of them.I have a scanner, and some OCR processes I run things through. I am close to 85% from my automatic process.The pain of going from 85% to 99% though is considerable. (and in my case manual) (well Perl helps)I went to try this AI on one of the short poem manufscript I have.I told the prompt I wanted PDF to Markdown, it says sure go ahead give me the pdf. I went upload it. It spent a long time spinning. then a quick messages comes up, something like"Failed to count tokens"but it just flashes and goes away.I guess the PDF is too big? Weird though, its not a lot of pages.

评论 #42958103 未加载

评论 #42956006 未加载

pbronez3 个月前

Wonder how this compares to Docling. So far that's been the only tool that really unlocked PDFs for me. It's solid but really annoying to install.<a href="https://ds4sd.github.io/docling/" rel="nofollow">https://ds4sd.github.io/docling/</a>

matthest3 个月前

This is completely tangential, but does anyone know if AI is creating any new jobs?Thinking of the OCR vendors who get replaced. Where might they go?One thing I can think of is that AI could help the space industry take off. But wondering if there are any concrete examples of new jobs being created.

评论 #42959499 未加载

mansourdaman3 个月前

I've built a simple OCR tool with gemini 2 flash with several options: 1-Simple OCR: Extracts all detected text from uploaded files 2-Advanced OCR: Enables rule-based extraction (e.g., table data) 3-Bulk OCR: Designed for processing multiple files at once The project will be open-source next week. You can try the tool here: <a href="https://gemini2flashocr.netlify.app" rel="nofollow">https://gemini2flashocr.netlify.app</a>

nickandbro3 个月前

I think very soon a new model will destroy whatever startups and services are built around document ingestion. As in a model that can take in a pdf page as a image and transcribe it to text with near perfect accuracy.

评论 #42955074 未加载

评论 #42954513 未加载

BenGosub3 个月前

They do not test Llamaparse on the accuracy benchmark. In my personal experience Llamaparse was one of the rare tools that always got the right information. Also, the accuracy is only based on tables and we had issues with irregular text structures as well. It is also worth noting that when using an LLM, a non-deterministic tool to do something deterministic is a bit risky and you need to write, modify and maintain a prompt.

uri_merhav3 个月前

Gemini Flash 2.0 is impressive but it hardly captures all of the information in the PDF. It's great for getting vibes from the document or finding overall information in it. If you ask it to e.g. enumerate every line item from multiple tables in a long PDF it still falls flat (dropping some line items or entire sections etc). DocuPanda and to a lesser extent Unstrucutred handle this.

dwheeler3 个月前

I wish more PDFs were generated as hybrid PDFs. These are PDFs that also include their original source material. Then you have a document whose format is fixed, but if you need more semantic information, there it is!LibreOffice makes this especially easy to do: <a href="https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid" rel="nofollow">https://wiki.documentfoundation.org/Faq/Writer/PDF_Hybrid</a>

daemonologist3 个月前

I wonder how this compares to open source models (which might be less accurate but even cheaper if self-hosted?), e.g. Llama 3.2. I'll see if I can run the benchmark.Also regarding the failure case in the footnote, I think Gemini actually got that right (or at least outperformed Reducto) - the original document seems to have what I call a "3D" table where the third axis is rows within each cell, and having multiple headers is probably the best approximation in Markdown.

评论 #42953492 未加载

rp363 个月前

I'm failing to understanding the Ingesting part of the Gemini 2.0? Does Gemini provide the a process to convert PDFs to Markdown API OR the LLM APIs handle it with prompt "Extract the Attached PDF" using this API: <a href="https://ai.google.dev/gemini-api/docs/document-processing?lang=node" rel="nofollow">https://ai.google.dev/gemini-api/docs/document-processing?la...</a>

zoogeny3 个月前

Orthogonal to this post, but this just highlights the need for a more machine readable PDF alternative.I get the inertia of the whole world being on PDF. And perhaps we can just eat the cost and let LLMs suffer the burden going forwards. But why not use that LLM coding brain power to create a better overall format?I mean, do we really see printing things out onto paper something we need to worry about for the next 100 years? It reminds me of the TTY interface at the heart of Linux. There was a time it all made sense, but can we just deprecate it all now?

评论 #42955191 未加载

siquick3 个月前

Strange that LlamaParse is mentioned in the pricing table but not the results. We’ve used them to process a lot of pages and it’s been excellent each time.

xena3 个月前

I really wish that Google made an endpoint that's compatible with the OpenAI API. That'd make trying Gemini in existing flows so much easier.

评论 #42954457 未加载

评论 #42954430 未加载

jonesn113 个月前

OCR makes sense, but it is another asking for a summary. It is not there yet, gave a lot of incorrect details.

fecal_henge3 个月前

Is there an AI platform where I can paste a snip of a graph and it will generate a n th order polynomial regression for me of the trace?

评论 #42953653 未加载

评论 #42953982 未加载

kym64643 个月前

RE: the loss of bounding box informationYou can recover word-level bounding boxes and confidence scores by using a traditional OCR engine such as AWS Textract and matching the results to Gemini’s output – see <a href="https://docless.app" rel="nofollow">https://docless.app</a> for a demo (disclaimer: I am the founder)

eichi3 个月前

If is is a vendor work, you should probably hire person who are competitive in software engineering space. And do we actually need significant amount of processing as a solution? If this is the case, common markdowned public pdfs should be open-sourced. We shouldn't repeat other's work.Despite that, cheaper is better.

lyjackal3 个月前

If the end goal is just rag or search over the pdfs, seems like ColPali based embedding search would be a good alternative here. Don’t process the PDFs, instead just search their image embedding directly. From what I understand, you also get a sort of attention as to what part of the image is being activated by the search.

an_aparallel3 个月前

Has anyone in the AEC industry who's reading this worked out a good way to get Bluebeam MEP, electrical layouts into Revit (LOD 200-300).Have seen MarkupX as a paid option, but it seems some AI in the loop can greatly speed up exception handling, encode family placement to certain elevations based on building code docs....

airwaveai3 个月前

Curious to see how well this works on technical/mechanical documentation (manuals parts list etc). Has any one tried? My company Airwave had to jump through all sorts of hoops to get accurate information for our use case: getting accurate info to the technicians in the field.

ritvikpandey213 个月前

ritvik here from pulse. everyone’s pretty much made the right points here, but wanted to emphasize that due to the llm architecture, they predict “the most probable text string” that corresponds to the embedding, not necessarily the exact text. this non-deterministicness is awful for customers deploying in production and a lot of our customers complained about this to us initially. the best approach is to build a sort-of “agent”-based VLM x traditional layout segmentation/reading order algos, which is what we’ve done and are continuing to do.we have a technical blog on this exact phenomena coming out in the next couple days, will attach it here when it’s out!check us out at <a href="https://www.runpulse.com">https://www.runpulse.com</a>

bambax3 个月前

I'm building a system that does regular OCR and outputs layout-following ASCII; in my admittedly limited tests it works better than most existing offerings.It will be ready for beta testing this week or the next, and I will be looking for beta testers; if interested please contact me!

devmor3 个月前

I think this is one of the few functional applications of LLMs that is really undeniably useful.OCR has always been “untrustworthy” (as in you cannot expect it to be 100% correct and know you must account for that) and we have long used ML algorithms for the process.

评论 #42959647 未加载

sergiotapia3 个月前

The article mentions OCR, but you're sending a PDF how is that OCR? Or is this is mistake? What if you send photos of the pages, that would be true OCR - does the performance and price remain the same?If so this unlocks a massive workflow for us.

lacoolj3 个月前

Anyone know if there are uses of this with PHI? Most doctors still fax reports to each other and this would help a lot to drop the load on staff when receiving and categorizing/assigning to patients

andrewshadura3 个月前

> Crucially, we’ve seen very few instances where specific numerical values are actually misread."Very few" is way too many. This means it cannot be trusted, especially when it comes to financial data.

jgleoj233 个月前

Gemini is amazing but I get this copyright error for some documents and I have a rate limit of just 10 requests per minute. Same issues with claude except the copyright error is called content warning.

cedws3 个月前

90% accuracy +/- 10%? What could that be useful for, that’s awfully low.

评论 #42953249 未加载

评论 #42953392 未加载

评论 #42953853 未加载

评论 #42953775 未加载

评论 #42953490 未加载

评论 #42954057 未加载

评论 #42953432 未加载

mateuszbuda3 个月前

There’s AWS Bedrock Knowledge Base (Amazon proprietary RAG solution) which can digest PDFs and, as far as I tested it on real world documents, it works pretty well and is cost effective.

dasl3 个月前

How does the Gemini OCR perform against non-English language text?

jibuai3 个月前

I've been working on something similar the past couple months. A few thoughts:- A lot of natural chunk boundaries span multiple pages, so you need some 'sliding window' mechanism for the best accuracy.- Passing the entire document hurts throughput too much due to the quadratic complexity of attention. Outputs are also much worse when you use too much context.- Bounding boxes can be solved by first generating boxes using tradition OCR / layout recognition, then passing that data to the LLM. The LLM can then link it's outputs to the boxes. Unfortunately getting this reliable required a custom sampler so proprietary models like Gemini are out of the question.

geckel3 个月前

How is it for image recognition/classification? OCR can be a huge chunk of the image classification pipeline. Presumably, it works just as well in this domain?

hnuser4353 个月前

Damn, I thought this was about the Gemini protocol.<a href="https://geminiprotocol.net/" rel="nofollow">https://geminiprotocol.net/</a>

cubefox3 个月前

Why is Gemini Flash so much cheaper than other models here?

评论 #42954081 未加载

mansourdamanpak3 个月前

I've built a simple OCR tool with Gemini 2 flash you can test it here :gemini2flashocr.netlify.app

jeswin3 个月前

We've previously tried Sonnet in our PDF extraction pipelines. It was very, very accurate, gpt-4o did not come close. Its more expensive, however.

nottorp3 个月前

Will 2.0.1 also change everything?How about 2.0.2?How about Llama 13.4.0.1?This is tiring. It's always the end of the world when they release a new version of some LLM.

评论 #42959633 未加载

sensecall3 个月前

This is super interesting.Would this be suitable for ingesting and parsing wildly variable unstructured data into a structured schema?

applgo4433 个月前

Why are traditional OCRs better in terms of hallucination and confidence scores?Can we use logprobs of LLM as confidence scores?

评论 #42961315 未加载

grandimam3 个月前

Would you recommend using these large models for parsing sensitive data - probably say bank statements etc?

jwr3 个月前

I wish I could do this locally. I don't feel comfortable uploading all of my private documents to Google.

aravart3 个月前

Does anyone have some fleshed out source code, prompts and all, to try this on Gemini 2.0?

评论 #42980583 未加载

KoolKat233 个月前

Okay I just checked/tried this out with my own use case at work and it's insane.

ady99993 个月前

We have been building smaller and more efficient VLMs for document extraction from way before and we are 10x faster than unstructured,reducto (the ocr vendors) with an accuracy of 90%.P.S. - You can find us here (unsiloed-ai.com) or you can reach out to me on adnan.abbas@unsiloed-ai.com

iudqnolq3 个月前

In what contexts is 0.84 ± 0.16 actually "nearly perfect"?

评论 #42961102 未加载

ratedgene3 个月前

Is this something we can run locally? if so what's the license?

评论 #42955662 未加载

otabdeveloper43 个月前

Well, probably not literally "everything".

seunosewa3 个月前

He found the one thing that Gemini does better.

throw73813 个月前

For data extraction from long documents (100k+ tokens) how does structured outputs via providing a json schema compare vs asking one question per field (in natural language)?Also I've been hearing good things regarding document retrieval about Gemini 1.5 Pro, 2.0 Flash and gemini-exp-1206 (the new 2.0 Pro?), which is the best Gemini model for data extraction from 100k tokens?How do they compare against Claude Sonnet 3.5 or the OpenAI models, has anyone done any real world tests?

lifeisstillgood3 个月前

Imagine there's no PostScriptIt's easy if you tryNo pdfs below usAbove us only SQLImagine all the people livin' for CSV

mchadda_chunkr3 个月前

Hi all - CEO of chunkr.ai here.The write-up and ensuing conversation are really exciting. I think out of everything mentioned here - the clear stand-out point is that document layout analysis (DLA) is the crux of the issue for building practical doc ingestion for RAG.(Note: DLA is the process of identifying and bounding specific segments of a document - like section headers, tables, formulas, footnotes, captions, etc.)Strap in - this is going to be a longy.We see a lot of people and products basically sending complete pages to LVLMs for converting to a machine-readable format, and for chunking. We tried this + it’s a possible configuration on chunkr as well. It has never worked for our customers, or during extensive internal testing across documents from a variety of verticals. Here are SOME of the common problems:- Most documents are dense. The model will not OCR everything and miss crucial parts.- A bunch of hallucinated content thats tough to catch.- Occasionally it will just refuse to give you anything. We’ve tried a bunch of different prompting techniques and the models return “<image>” or “||..|..” for an ENTIRE PAGE of content.Despite this - it’s obvious that these ginormous neural nets are great for complex conversions like tables and formulas to HTML/Markdown & LateX. They also work great for describing images and converting charts to tables. But that’s the thing - they can only do this if you can pull out these document features individually as cropped images and have the model focus on small snippets of the document rather than the full page.If you want knobs for speed, quality, and cost, the best approach is to work at a segment level rather than a page level. This is where DLA really shines - the downstream processing options are vast and can be fit to specific needs. You can choose what to process with simple + fast OCR (text-only segments like headers, paragraphs, captions), and what to send to a large model like Gemini (complex segments like tables, formulas, and images) - all while getting juicy bounding boxes for mapping citations. Combine this with solid reading order algos - and you get amazing layout-aware chunking that takes ~10ms.We made RAG apps ourselves and attempted to index all ~600 million pages of open-access research papers for <a href="https://lumina.sh">https://lumina.sh</a>. This is why we built Chunkr - and it needed to be Open Source. You can self-host our solution and process 4 pages per second, scaling up to 11 million pages per month on a single RTX 4090, renting this hardware on Runpod costs just $249/month ($0.34/hour).A VLM to do DLA sounds awesome. We've played around with this idea but found that VLMs don't come close to models where the architecture is solely geared toward these specific object detection tasks. While it would simplify the pipeline, VLMs are significantly slower and more resource-hungry - they can't match the speed we achieve on consumer hardware with dedicated models. Nevertheless, the numerous advances in the field are very exciting - big if true!A note on costs:There are some discrepancies between the API pricing of providers listed in this thread. Assuming 100000 pages + feature parity:Chunkr API - 200 pages for $1, not 100 pagesAWS Textract - 40 pages for $1, not 1000 pages (No VLMs)Llama Parse - 13 pages for $1, not 300A note on RD-Bench:We’ve been using Gemini 1.5 Pro for tables and other complex segments for a while, so the RD-bench is very outdated. We ran it again on a few hundred samples and got a 0.81 (also includes some notes on the bench itself). To the OP: it would be awesome if you could update your blog post!<a href="https://github.com/lumina-ai-inc/chunkr-table-rdbench/tree/main">https://github.com/lumina-ai-inc/chunkr-table-rdbench/tree/m...</a>

killer-Xbox3 个月前

sho_hn3 个月前

Remember all the hyperbole a year ago on how Google was failing and over?

评论 #42954275 未加载

throwaway314123 个月前

fds

ein0p3 个月前

> Why Gemini 2.0 Changes EverythingClickbait. It doesn't change "everything". It makes ingestion for RAG much less expensive (and therefore feasible in a lot more scenarios), at the expense of ~7% reduction in accuracy. Accuracy is already rather poor even before this, however, with the top alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the author seems to suggest that the failure modes are mostly around formatting rather than e.g. mis-recognition or hallucinations.TL;DR: is this exciting? If you do RAG, yes. Does it "change everything" nope. There's still a very long way to go. Protip for model designers: accuracy is always in greater demand than performance. A slow model that solves the problem is invariably better than a fast one that fucks everything up.

评论 #42954521 未加载

nothrowaways3 个月前

Cool

pockmarked193 个月前

Now, I could look at this relatively popular post about Google and revise my opinion of HN as an echo chamber, but I’m afraid it’s just that the downvote loving HNers weren’t able to make the cognitive leap from Gemini to Google.

raunakchowdhuri3 个月前

CTO of Reducto here. Love this writeup!We’ve generally found that Gemini 2.0 is a great model and have tested this (and nearly every VLM) very extensively.A big part of our research focus is incorporating the best of what new VLMs offer without losing the benefits and reliability of traditional CV models. A simple example of this is we’ve found bounding box based attribution to be a non-negotiable for many of our current customers. Citing the specific region in a document where an answer came from becomes (in our opinion) even MORE important when using large vision models in the loop, as there is a continued risk of hallucination.Whether that matters in your product is ultimately use case dependent, but the more important challenge for us has been reliability in outputs. RD-TableBench currently uses a single table image on a page, but when testing with real world dense pages we find that VLMs deviate more. Sometimes that involves minor edits (summarizing a sentence but preserving meaning), but sometimes it’s a more serious case such as hallucinating large sets of content.The more extreme case is that internally we fine tuned a version of Gemini 1.5 along with base Gemini 2.0, specifically for checkbox extraction. We found that even with a broad distribution of checkbox data we couldn’t prevent frequent checkbox hallucination on both the flash (+17% error rate) and pro model (+8% error rate). Our customers in industries like healthcare expect us to get it right, out of the box, deterministically, and our team’s directive is to get as close as we can to that ideal state.We think that the ideal state involves a combination of the two. The flexibility that VLMs provide, for example with cases like handwriting, is what I think will make it possible to go from 80 or 90 percent accuracy to some number very close 99%. I should note that the Reducto performance for table extraction is with our pre-VLM table parsing pipeline, and we’ll have more to share in terms of updates there soon. For now, our focus is entirely on the performance frontier (though we do scale costs down with volume). In the longer term as inference becomes more efficient we want to move the needle on cost as well.Overall though, I’m very excited about the progress here.--- One small comment on your footnote, the evaluation script with Needlemen-Wunsch algorithm doesn’t actually consider the headers outputted by the models and looks only at the table structure itself.

评论 #42954464 未加载

resource_waste3 个月前

Google's models have historically been total disappointments compared to chatGPT4. Worse quality, wont answer medical questions either.I suppose I'll try it again, for the 4th or 5th time.This time I'm not excited. I'm expecting it to be a letdown.

coderstartup3 个月前

Following this post

exabrial3 个月前

You know what'd be fucking nice? The ability to turn Gemini off.

评论 #43081117 未加载