TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: an API to extract text from a PDF

51 pointsby trezalmost 12 years ago

12 comments

hnriotalmost 12 years ago
Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.
评论 #5967164 未加载
zdwalmost 12 years ago
If you&#x27;re doing this local&#x2F;cli<p>`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http:&#x2F;&#x2F;www.foolabs.com&#x2F;xpdf&#x2F;</a><p>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick&#x27;s `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;tesseract-ocr&#x2F;</a>) works passably well.
kijinalmost 12 years ago
I have some questions:<p>1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF-&gt;text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)<p>2. &quot;The PDF file should be smaller than 1 Mbit&quot; -&gt; You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.
评论 #5965887 未加载
midasalmost 12 years ago
Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It&#x27;s hard to do well though.
评论 #5966725 未加载
评论 #5966276 未加载
评论 #5966257 未加载
rcfoxalmost 12 years ago
I&#x27;ve recently been working on extracting text from PDFs myself. I&#x27;ve found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I&#x27;ve submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.
chensteralmost 12 years ago
I googled &quot;converting PDF to text&quot; and &quot;converting PDF to html&quot;. A tons of services already exist out there. Apparently, it&#x27;s not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?
评论 #5966685 未加载
TillEalmost 12 years ago
Neat, but practically who would want to do this with an API rather than installable software?
评论 #5965881 未加载
ismaelcalmost 12 years ago
Hey I&#x27;ve documented this in Mashape - <a href="https://www.mashape.com/ismaelc/extract-text-from-pdfs#!documentation" rel="nofollow">https:&#x2F;&#x2F;www.mashape.com&#x2F;ismaelc&#x2F;extract-text-from-pdfs#!docu...</a>
评论 #5976602 未加载
surapanenialmost 12 years ago
This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http:&#x2F;&#x2F;searchtower.com</a> , where you can store, view, index and search the data.
architguptaalmost 12 years ago
Do you do OCR for text extraction?
评论 #5965837 未加载
raalmost 12 years ago
Nice. Why no paid options? I&#x27;m guessing because this was a weekend project.<p>If so, nice work!
评论 #5965687 未加载
alkoualmost 12 years ago
do you use pdftotext internally or something else?
评论 #5966691 未加载