PDF to Text, a challenging problem

266 点作者 ingve大约 16 小时前

41 条评论

90s_dev大约 12 小时前

Have any of you ever thought to yourself, this is new and interesting, and then vaguely remembered that you spent months or years becoming an expert at it earlier in life but entirely forgot it? And in fact large chunks of the very interesting things you've done just completely flew out of your mind long ago, to the point where you feel absolutely new at life, like you've accomplished relatively nothing, until something like this jars you out of that forgetfulness?I definitely vaguely remember doing some incredibly cool things with PDFs and OCR about 6 or 7 years ago. Some project comes to mind... google tells me it was "tesseract" and that sounds familiar.

评论 #43977155 未加载

评论 #43980118 未加载

评论 #43977580 未加载

评论 #43978171 未加载

评论 #43981115 未加载

评论 #43976796 未加载

评论 #43977605 未加载

评论 #43976023 未加载

评论 #43976086 未加载

评论 #43979324 未加载

herodotus大约 3 小时前

This is mostly what I worked on for many years at Apple with reasonable success. The main secret was to accept that everything was geometry, and use cluster analysis to try to distinguish between word gaps and letter gaps. On many PDF documents, it works really well, but there are so many different kinds of PDF documents that there are always cases were the results are not that great. If I were to do it today, I would stick with geometry, avoid OCR completely, but use machine learning. One big advantage for machine learning is that I could use existing tools to generate PDFs from known text, so that the training phase could be completly automatic. (Here is Bertrand Serlet announcing the feature at WWDC in 2009: <a href="https://youtu.be/FTfChHwGFf0?si=wNCfI9wZj1aj9rY7&t=308" rel="nofollow">https://youtu.be/FTfChHwGFf0?si=wNCfI9wZj1aj9rY7&t=308</a>)

svat大约 15 小时前

One thing I wish someone would write is something like the browser's developer tools ("inspect elements") for PDF — it would be great to be able to "view source" a PDF's content streams (the BT … ET operators that enclose text, each Tj operator for setting down text in the currently chosen font, etc), to see how every “pixel” of the PDF is being specified/generated. I know this goes against the current trend / state-of-the-art of using vision models to basically “see” the PDF like a human and “read” the text, but it would be really nice to be able to actually understand what a PDF file contains.There are a few tools that allow inspecting a PDF's contents (<a href="https://news.ycombinator.com/item?id=41379101">https://news.ycombinator.com/item?id=41379101</a>) but they stop at the level of the PDF's objects, so entire content streams are single objects. For example, to use one of the PDFs mentioned in this post, the file <a href="https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2022-68-1.pdf" rel="nofollow">https://bfi.uchicago.edu/wp-content/uploads/2022/06/BFI_WP_2...</a> has, corresponding to page number 6 (PDF page 8), a content stream that starts like (some newlines added by me):<pre><code> 0 g 0 G 0 g 0 G BT /F19 10.9091 Tf 88.936 709.041 Td [(Subsequen)28(t)-374(to)-373(the)-373(p)-28(erio)-28(d)-373(analyzed)-373(in)-374(our)-373(study)83(,)-383(Bridge's)-373(paren)27(t)-373(compan)28(y)-373(Ne)-1(wGlob)-27(e)-374(reduced)]TJ -16.936 -21.922 Td [(the)-438(n)28(um)28(b)-28(er)-437(of)-438(priv)56(ate)-438(sc)28(ho)-28(ols)-438(op)-27(erated)-438(b)28(y)-438(Bridge)-437(from)-438(405)-437(to)-438(112,)-464(and)-437(launc)28(hed)-438(a)-437(new)-438(mo)-28(del)]TJ 0 -21.923 Td </code></pre> and it would be really cool to be able to see the above “source” and the rendered PDF side-by-side, hover over one to see the corresponding region of the other, etc, the way we can do for a HTML page.

评论 #43975665 未加载

评论 #43974386 未加载

评论 #43979271 未加载

评论 #43974502 未加载

kbyatnal大约 13 小时前

"PDF to Text" is a bit simplified IMO. There's actually a few class of problems within this category:1. reliable OCR from documents (to index for search, feed into a vector DB, etc)2. structured data extraction (pull out targeted values)3. end-to-end document pipelines (e.g. automate mortgage applications)Marginalia needs to solve problem #1 (OCR), which is luckily getting commoditized by the day thanks to models like Gemini Flash. I've now seen multiple companies replace their OCR pipelines with Flash for a fraction of the cost of previous solutions, it's really quite remarkable.Problems #2 and #3 are much more tricky. There's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it's going to take time and effort. The future is definitely moving in this direction though.Disclaimer: I started a LLM doc processing company to help companies solve problems in this space (<a href="https://extend.ai" rel="nofollow">https://extend.ai</a>)

评论 #43976790 未加载

评论 #43977158 未加载

评论 #43976203 未加载

dwheeler大约 14 小时前

The better solution is to embed, in the PDF, the editable source document. This is easily done by LibreOffice. Embedding it takes very little space in general (because it compresses well), and then you have MUCH better information on what the text is and its meaning. It works just fine with existing PDF readers.

评论 #43975401 未加载

评论 #43974667 未加载

评论 #43974983 未加载

评论 #43976216 未加载

评论 #43975217 未加载

trevor-e大约 5 小时前

Having built some toy parsers for PDF files in the past it was a huge wtf moment for me when I realized how the format works. With that in mind, it's even more puzzling how it's used often in text-heavy cases.I always think about the invoicing use-case: digital systems should be able to easy extract data from the file while still being formatted visually for humans. It seems like the tech world would be much better off if we migrated to a better format.

评论 #43980460 未加载

1vuio0pswjnm7大约 12 小时前

Below is a PDF. It is a .txt file. I can save it with a .pdf extension and open it in a PDF viewer. I can make changes in a text editor. For example, by editing this text file, I can change the text displayed on the screen when the PDF is opened, the font, font size, line spacing, the maximum characters per line, number of lines per page, the paper width and height, as well as portrait versus landscape mode.<pre><code> %PDF-1.4 1 0 obj << /CreationDate (D:2025) /Producer >> endobj 2 0 obj << /Type /Catalog /Pages 3 0 R >> endobj 4 0 obj << /Type /Font /Subtype /Type1 /Name /F1 /BaseFont /Times-Roman >> endobj 5 0 obj <> /ProcSet [ /PDF /Text ] >> endobj 6 0 obj << /Type /Page /Parent 3 0 R /Resources 5 0 R /Contents 7 0 R >> endobj 7 0 obj << /Length 8 0 R >> stream BT /F1 50 Tf 1 0 0 1 50 752 Tm 54 TL (PDF is)' ((a) a text format)' ((b) a graphics format)' ((c) (a) and (b).)' ()' ET endstream endobj 8 0 obj 53 endobj 3 0 obj << /Type /Pages /Count 1 /MediaBox [ 0 0 612 792 ] /Kids [ 6 0 R ] >> endobj xref 0 9 0000000000 65535 f </code></pre> 0000000009 00000 n 0000000113 00000 n 0000000514 00000 n 0000000162 00000 n 0000000240 00000 n 0000000311 00000 n 0000000391 00000 n 0000000496 00000 n trailer << /Size 9 /Root 2 0 R /Info 1 0 R >> startxref 599 %%EOF

评论 #43976133 未加载

评论 #43976276 未加载

评论 #43980858 未加载

gerdesj大约 6 小时前

PDF is a display format. It is optimised for eyeballs and printers. There has been some feature creep. It is a rubbish machine data transfer mechanism but really good for humans and say storing a page of A4 (letter for the US).So, you start off with the premise that a .pdf stores text and you want that text. Well that's nice: grow some eyes!Otherwise, you are going to have to get to grips with some really complicated stuff. For starters, is the text ... text or is it an image? Your eyes don't care and will just work (especially when you pop your specs back on) but your parser is probably seg faulting madly. It just gets worse.PDF is for humans to read. Emulate a human to read a PDF.

bartread大约 15 小时前

Yeah, getting text - even structured text - out of PDFs is no picnic. Scraping a table out of an HTML document is often straightforward even on sites that use the "everything's a <div>" (anti-)pattern, and especially on sites that use more semantically useful elements, like <table>.Not so PDFs.I'm far from an expert on the format, so maybe there is some semantic support in there, but I've seen plenty of PDFs where tables are simply an loose assemblage of graphical and text elements that, only when rendered, are easily discernible as a table because they're positioned in such a way that they render as a table.I've actually had decent luck extracting tabular data from PDFS by converting the PDFs to HTML using the Poppler PDF utils, then finding the expected table header, and then using the x-coordinate of the HTML elements for each value within the table to work out columns, and extract values for each rows.It's kind of groaty but it seems reliable for what I need. Certainly much moreso than going via formatted plaintext, which has issues with inconsistent spacing, and the insertion of newlines into the middle of rows.

评论 #43976596 未加载

评论 #43976252 未加载

评论 #43974220 未加载

ted_dunning大约 13 小时前

One of my favorite documents for highlighting the challenges described here is the PDF for this article:<a href="https://academic.oup.com/auk/article/126/4/717/5148354" rel="nofollow">https://academic.oup.com/auk/article/126/4/717/5148354</a>The first page is classic with two columns of text, centered headings, a text inclusion that sits between the columns and changes the line lengths and indentations for the columns. Then we get the fun of page headers that change between odd and even pages and section header conventions that vary drastically.Oh... to make things even better, paragraphs doing get extra spacing and don't always have an indented first line.Some of everything.

评论 #43975598 未加载

patrick41638265大约 10 小时前

Good old <a href="https://linux.die.net/man/1/pdftotext" rel="nofollow">https://linux.die.net/man/1/pdftotext</a> and a little Python on top of its output will get you a long way if your documents are not too crazy. I use it to parse all my bank statements into an sqlite database for analysis.

gibsonf1大约 13 小时前

We[1] Create "Units of Thought" from PDF's and then work with those for further discovery where a "Unit of Thought" is any paragraph, title, note heading - something that stands on its own semantically. We then create a hierarchy of objects from that pdf in the database for search and conceptual search - all at scale.[1] <a href="https://graphmetrix.com/trinpod-server" rel="nofollow">https://graphmetrix.com/trinpod-server</a> <a href="https://trinapp.com" rel="nofollow">https://trinapp.com</a>

评论 #43977712 未加载

smcleod大约 11 小时前

Definitely recommend docling for this. <a href="https://docling-project.github.io/docling/" rel="nofollow">https://docling-project.github.io/docling/</a>

incanus77大约 7 小时前

I did some contract work some years back with a company who had a desktop product (for Mac) that would apply some smarts to strip out extraneous things on pages while printing (such as ads on webpages) as well as try to avoid the case where only a line or two was printed on a page, wasting paper. It initially was getting into things at the PostScript layer, which unsurprisingly was horrifying, but eventually worked on PDFs. This required finding and interpreting various textual parts of the passed documents and was a pretty big technical challenge.While I'm not convinced it was viable at the business level, it feels like something platform/OS companies could focus on to have a measurable environmental and cost overhead impact.

bob1029大约 14 小时前

When accommodating the general case, solving PDF-to-text is approximately equivalent to solving JPEG-to-text.The only PDF parsing scenario I would consider putting my name on is scraping AcroForm field values from standardized documents.

评论 #43974604 未加载

评论 #43974634 未加载

xnx大约 15 小时前

Weird that there's no mention of LLMs in this article even though the article is very recent. LLMs haven't solved every OCR/document data extraction problem, but they've dramatically improved the situation.

评论 #43974337 未加载

评论 #43974325 未加载

评论 #43974562 未加载

评论 #43975686 未加载

评论 #43974229 未加载

wrs大约 15 小时前

Since these are statistical classification problems, it seems like it would be worth trying some old-school machine learning (not an LLM, just an NN) to see how it compares with these manual heuristics.

评论 #43974445 未加载

rekoros大约 8 小时前

I've been using Azure's "Document Intelligence" thingy (prebuilt "read" model) to extract text from PDFs with pretty good results [1]. Their terminology is so bad, it's easy to dismiss the whole thing for another Microsoft pile, but it actually, like, for real, works.[1] <a href="https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/read?view=doc-intel-4.0.0&tabs=sample-code" rel="nofollow">https://learn.microsoft.com/en-us/azure/ai-services/document...</a>

remram大约 6 小时前

I built a simple OSS tool for qualitative data analysis, which needs to turn uploaded documents into text (stripped HTML). PDFs have been a huge problem from day one.I have investigated many tools, but two-column layouts and footers etc often still mess up the content.It's hard to convince my (often non-technical) users that this is a difficult problem.

Sharlin大约 10 小时前

Some of the unsung heroes of the modern age are the programmers who, through what must have involved a lot of weeping and gnashing of teeth, have managed to implement the find, select, and copy operations in PDF readers.

noosphr大约 10 小时前

I've worked on this in my day job: extracting _all_ relevant information from a financial services PDF for a bert based search engine.The only way to solve that is with a segmentation model followed by a regular OCR model and whatever other specialized models you need to extract other types of data. VLM aren't ready for prime time and won't be for a decade on more.What worked was using doclaynet trained YOLO models to get the areas of the document that were text, images, tables or formulas: <a href="https://github.com/DS4SD/DocLayNet">https://github.com/DS4SD/DocLayNet</a> if you don't care about anything but text you can feed the results into tesseract directly (but for the love of god read the manual). Congratulations, you're done.Here's some pre-trained models that work OK out of the box: <a href="https://github.com/ppaanngggg/yolo-doclaynet">https://github.com/ppaanngggg/yolo-doclaynet</a> I found that we needed to increase the resolution from ~700px to ~2100px horizontal for financial data segmentation.VLMs on the other hand still choke on long text and hallucinate unpredictably. Worse they can't understand nested data. If you give _any_ current model nothing harder than three nested rectangles with text under each they will not extract the text correctly. Given that nested rectangles describes every table no VLM can currently extract data from anything but the most straightforward of tables. But it will happily lie to you that it did - after all a mining company should own a dozen bulldozers right? And if they each cost $35.000 it must be an amazing deal they got, right?

评论 #43978106 未加载

EmilStenstrom大约 14 小时前

I think using Gemma3 in vision mode could be a good use-case for converting PDF to text. It’s downloadable and runnable on a local computer, with decent memory requirements depending on which size you pick. Did anyone try it?

评论 #43976271 未加载

评论 #43975132 未加载

coolcase大约 8 小时前

Tried extracting data from a newspaper. It is really hard. What is a headline and which headline belongs to which paragraphs? Harder than you think! And chucking it as is into OpenAI was no good at all. Manually dealing with coordinates from OCR was better but not perfect.

rad_gruchalski大约 15 小时前

So many of these problems have been solved by mozilla pdf.js together with its viewer implementation: <a href="https://mozilla.github.io/pdf.js/" rel="nofollow">https://mozilla.github.io/pdf.js/</a>.

评论 #43974428 未加载

评论 #43974240 未加载

评论 #43975184 未加载

elpalek大约 11 小时前

Recently tested a (non-english) pdf ocr with Gemini 2.5 Pro. First, directly ask it to extract text from pdf. Result: random text blob, not useable.Second, I converted pdf into pages of jpg. Gemini performed exceptional. Near perfect text extraction with intact format in markdown.Maybe there's internal difference when processing pdf vs jpg inside the model.

评论 #43977077 未加载

PeterStuer大约 12 小时前

I guess I'm lucky the PDF's I need to process are mostly rather dull unadventurous layouts. So far I've had great success using docling.

nicodjimenez大约 12 小时前

Check out mathpix.com. We handle complex tables, complex math, diagrams, rotated tables, and much more, extremely accurately.Disclaimer: I'm the founder.

viking2917大约 7 小时前

coincidentally, posted this over on Show HN today. OCR workbench, AI OCR & editing tools for OCRing old / hard documents. <a href="https://news.ycombinator.com/item?id=43976450">https://news.ycombinator.com/item?id=43976450</a>. Tesseract works fine for modern text documents, but it fails badly on older docs (e.g. colonial american, etc)

fracus大约 6 小时前

Why hasn't the PDF standard been replaced or revised to require the text in meta form? Seems like a no brainer.

评论 #43980871 未加载

andrethegiant大约 14 小时前

Cloudflare’s ai.toMarkdown() function available in Workers AI can handle PDFs pretty easily. Judging from speed alone, it seems they’re parsing the actual content rather than shoving into OCR/LLM.Shameless plug: I use this under the hood when you prefix any PDF URL with <a href="https://pure.md/" rel="nofollow">https://pure.md/</a> to convert to raw text.

评论 #43974514 未加载

评论 #43974535 未加载

评论 #43974602 未加载

评论 #43975027 未加载

bickfordb大约 11 小时前

Maybe it's time for new document formats and browsers that neatly separate content, presentation and UI layers? PDF and HTML are 20+ years old and it's often difficult to extract information from either let alone author a browser.

评论 #43976780 未加载

ljlolel大约 12 小时前

Mistral OCR has best in class doing document understanding<a href="https://mistral.ai/news/mistral-ocr" rel="nofollow">https://mistral.ai/news/mistral-ocr</a>

devrandoom大约 13 小时前

I currently use ocrmypdf for my private library. Then Recoll to index and search. Is there a better solution I'm missing?

anonu大约 12 小时前

They should called it NDF - Non-Portable Document Format.

评论 #43980779 未加载

dobraczekolada大约 12 小时前

Reminds me of github.com/docwire/docwire

TZubiri大约 6 小时前

As someone who has worked on this FT. (S&P, parsing of financial disclosures)The solution is OCR. Don't fuck with internal file format. PDF is designed to print/display stuff, not to be parseable by machines.

constantinum大约 13 小时前

PDF parsing is hell indeed, with all sorts of edge cases that breaks business workflows, more on that here <a href="https://unstract.com/blog/pdf-hell-and-practical-rag-applications/" rel="nofollow">https://unstract.com/blog/pdf-hell-and-practical-rag-applica...</a>

keybored大约 12 小时前

For people who want people to read their documents[1] they should have their PDF point to a more digital-friendly format, an alt document.Looks like you’ve found my PDF. You might want this version instead:PDFs are often subpar. Just see the first example: standard Latex serif section title. I mean, PDFs often aren’t even well-typeset for what they are (dead-tree simulations).[1] No sarcasm or truism. Some may just want to submit a paper to whatever publisher and go through their whole laundry list of what a paper ought to be. Wide dissemanation is not the point.

j45大约 15 小时前

Part of a problem being challenging is recognizing if it's new, or just new to us.We get to learn a lot when something is new to us.. at the same time the untouchable parts of PDF to Text are largely being solved with the help of LLMs.I built a tool to extract information from PDFs a long time ago, and the break through was having no ego or attachment to any one way of doing it.Different solutions and approaches offered different depth or quality of solutions and organizing them to work together in addition to anything I built myself provided what was needed - one place where more things work.. than not.

Obscurity4340大约 14 小时前

Is this what GoodPDF does?

reify大约 14 小时前

<a href="https://github.com/jalan/pdftotext">https://github.com/jalan/pdftotext</a>pdftotext -layout input.pdf output.txtpip install pdftotext