TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

ScienceBeam – using computer vision to extract PDF data

103 pointsby kaplunalmost 8 years ago

8 comments

aidosalmost 8 years ago
PDF is a pretty interesting format. The spec is actually a great read. It&#x27;s amazing how many features they&#x27;ve needed to add over the years to support everyone&#x27;s use cases.<p>It&#x27;s a display format that doesn&#x27;t have a whole lot of semantic meaning for the most part. Often every character is individually placed so even extracting words is a pain. It&#x27;s insane that OCR (which it sounds like this uses) is the easiest way to deal with extraction.<p>I highly recommend having a look inside a couple of pdfs to see how they look. I&#x27;ve posted about this before but the trick is to expand the streams.<p><pre><code> mutool clean -d in.pdf out.pdf</code></pre>
评论 #15088167 未加载
vogalmost 8 years ago
Some time ago I came to a similar conclusion: In most cases, the only way to properly process PDF files is to render them and work on the raster images.<p>I was involved in a project where we needed to determine the final size of an image in a PDF document.<p>This seemed simple: Just keep track of all transformation matrices applied to the image, then calculate the final size.<p>But we underestimated the nonsense complexity of PDF: The image could be a real image or an embedded EPS, which are completely different cases. The image could have inner transparency, but could can also have an outer alpha mask by the PDF document. Then there are clipping paths, but be aware of the always implicitly present clipping path that is the page boundary. Oh, and an image may be overlapped by text, or even another image, in which case you need to to the same processing for that one, too. And so on.<p>After wasting lots of time almost rebuilding a PDF renderer accidentally, we decided to use an existing renderer instead.<p>Turned out the only feasible solution was to render the PDF twice: with and without the image, and to compare the results pixel by pixel.<p>I&#x27;m afraid the modern web might develop in a similar direction.
Lxralmost 8 years ago
This looks really cool and is badly needed. Our company would kill for a PDF to semantic HTML algorithm (or service) too, using machine learning based on computer vision. Existing options just vomit enough CSS to match the PDF output, rather than mark up into headings, tables and the like.
davedxalmost 8 years ago
Good stuff.<p>What I think would be a really nice killer app would be using OCR to extract formulas directly into Matlab code. Would be awesome for reproducibility studies or just people trying to implement algorithms for whatever reason.<p>Anyone know if there&#x27;s an app for that already?
评论 #15084570 未加载
hprotagonistalmost 8 years ago
How do you address older PDFs that are scanlations and have no actual textual data at all, just embedded images?<p>In my experience, this is true for every PDF version of articles originally published before about 1990.
评论 #15083860 未加载
misiti3780almost 8 years ago
I havnt had a chance to read through this completely yet, but I&#x27;m curious if this method is agnostic to how the PDF was created originally (LATEX, Adobe, scanned images). It reads like that doesnt matter (treating it as an image) but I wanted to make sure.
ocrcustomserverover 7 years ago
Interesting. You can also try OCR and document layout analysis to do the same thing (without GPUs).<p>Shameless plug: if you&#x27;re interested in that sort of stuff, drop me a line, I might be able to help.
sharemywinalmost 8 years ago
couldn&#x27;t you use a pdf converter and convert to html or something else and translate that to your XML format?
评论 #15083584 未加载