PDF is a pretty interesting format. The spec is actually a great read. It's amazing how many features they've needed to add over the years to support everyone's use cases.<p>It's a display format that doesn't have a whole lot of semantic meaning for the most part. Often every character is individually placed so even extracting words is a pain. It's insane that OCR (which it sounds like this uses) is the easiest way to deal with extraction.<p>I highly recommend having a look inside a couple of pdfs to see how they look. I've posted about this before but the trick is to expand the streams.<p><pre><code> mutool clean -d in.pdf out.pdf</code></pre>
Some time ago I came to a similar conclusion: In most cases, the only way to properly process PDF files is to render them and work on the raster images.<p>I was involved in a project where we needed to determine the final size of an image in a PDF document.<p>This seemed simple: Just keep track of all transformation matrices applied to the image, then calculate the final size.<p>But we underestimated the nonsense complexity of PDF: The image could be a real image or an embedded EPS, which are completely different cases. The image could have inner transparency, but could can also have an outer alpha mask by the PDF document. Then there are clipping paths, but be aware of the always implicitly present clipping path that is the page boundary. Oh, and an image may be overlapped by text, or even another image, in which case you need to to the same processing for that one, too. And so on.<p>After wasting lots of time almost rebuilding a PDF renderer accidentally, we decided to use an existing renderer instead.<p>Turned out the only feasible solution was to render the PDF twice: with and without the image, and to compare the results pixel by pixel.<p>I'm afraid the modern web might develop in a similar direction.
This looks really cool and is badly needed. Our company would kill for a PDF to semantic HTML algorithm (or service) too, using machine learning based on computer vision. Existing options just vomit enough CSS to match the PDF output, rather than mark up into headings, tables and the like.
Good stuff.<p>What I think would be a really nice killer app would be using OCR to extract formulas directly into Matlab code. Would be awesome for reproducibility studies or just people trying to implement algorithms for whatever reason.<p>Anyone know if there's an app for that already?
How do you address older PDFs that are scanlations and have no actual textual data at all, just embedded images?<p>In my experience, this is true for every PDF version of articles originally published before about 1990.
I havnt had a chance to read through this completely yet, but I'm curious if this method is agnostic to how the PDF was created originally (LATEX, Adobe, scanned images). It reads like that doesnt matter (treating it as an image) but I wanted to make sure.
Interesting.
You can also try OCR and document layout analysis to do the same thing (without GPUs).<p>Shameless plug: if you're interested in that sort of stuff, drop me a line, I might be able to help.