TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: an API to extract text from a PDF

51 点作者 trez将近 12 年前

12 条评论

hnriot将近 12 年前
Why not just system(pdf2html) - I don't see the point since this level of functionality is trivially achieved. If it did something over and above that it might be useful, like OCR, but even that's not hard to add.
评论 #5967164 未加载
zdw将近 12 年前
If you&#x27;re doing this local&#x2F;cli<p>`pdftext`, from <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http:&#x2F;&#x2F;www.foolabs.com&#x2F;xpdf&#x2F;</a><p>For OCR, `pdfimages` (also from xpdf), combined with ImageMagick&#x27;s `convert`, and `tesseract` (<a href="http://code.google.com/p/tesseract-ocr/" rel="nofollow">http:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;tesseract-ocr&#x2F;</a>) works passably well.
kijin将近 12 年前
I have some questions:<p>1. Why return an array of texts? Where do the texts get split up? At page boundaries? Column boundaries? At the end of each line? If a line is interrupted by a corner of an image and continues a couple of inches afterward, does it get treated as a separate text? (I once used a PDF-&gt;text extractor program that spit out every word sepearately, often in an incorrect order. That probably had to do with how the PDF was organized internally.)<p>2. &quot;The PDF file should be smaller than 1 Mbit&quot; -&gt; You mean 1 megabyte, right? Because 1 megabit is only 125-128 kilobytes.
评论 #5965887 未加载
midas将近 12 年前
Going from PDF to nicely formatted word doc would be huge for lawyers and people who do a lot of contract negotiations. It&#x27;s hard to do well though.
评论 #5966725 未加载
评论 #5966276 未加载
评论 #5966257 未加载
rcfox将近 12 年前
I&#x27;ve recently been working on extracting text from PDFs myself. I&#x27;ve found that `pdftohtml -xml` from the Poppler utils does a decent job of it, and includes a bounding box for each piece of text. I&#x27;ve submitted a few patches to their Bugzilla to also include the transformation matrix as well as some extra styling information.
chenster将近 12 年前
I googled &quot;converting PDF to text&quot; and &quot;converting PDF to html&quot;. A tons of services already exist out there. Apparently, it&#x27;s not something new. How do you plan to compete? Are you planning to focus on data extraction rather than conversion?
评论 #5966685 未加载
TillE将近 12 年前
Neat, but practically who would want to do this with an API rather than installable software?
评论 #5965881 未加载
ismaelc将近 12 年前
Hey I&#x27;ve documented this in Mashape - <a href="https://www.mashape.com/ismaelc/extract-text-from-pdfs#!documentation" rel="nofollow">https:&#x2F;&#x2F;www.mashape.com&#x2F;ismaelc&#x2F;extract-text-from-pdfs#!docu...</a>
评论 #5976602 未加载
surapaneni将近 12 年前
This is similar to what we do at <a href="http://searchtower.com" rel="nofollow">http:&#x2F;&#x2F;searchtower.com</a> , where you can store, view, index and search the data.
architgupta将近 12 年前
Do you do OCR for text extraction?
评论 #5965837 未加载
ra将近 12 年前
Nice. Why no paid options? I&#x27;m guessing because this was a weekend project.<p>If so, nice work!
评论 #5965687 未加载
alkou将近 12 年前
do you use pdftotext internally or something else?
评论 #5966691 未加载