TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Good tools for text extraction from PDF

3 点作者 lucasrp大约 11 年前
Hi guys,<p>I&#x27;m needing a tool that allows me to convert PDF to html files. Since I work with public documents, sometimes the layout from the pdf can be pretty nasty (i&#x27;ve attached some links at the end of this post).<p>We have a in house soluction forked several years ago from Apache pdfBox. After a while we realized that forking a open source solution isnt the best answer, but kept on going because it worked.<p>Does anyone have sugestions? We are willing to contribute to the open source project we choose :)<p>Many thanks!<p>https:&#x2F;&#x2F;www.evernote.com&#x2F;shard&#x2F;s226&#x2F;sh&#x2F;17b87c1f-8f18-4b23-96ac-a9fbc2ac8502&#x2F;ea5618043f3a9c818071bd93df9f74c3<p>https:&#x2F;&#x2F;www.evernote.com&#x2F;shard&#x2F;s226&#x2F;sh&#x2F;17b87c1f-8f18-4b23-96ac-a9fbc2ac8502&#x2F;ea5618043f3a9c818071bd93df9f74c3

2 条评论

maxerickson大约 11 年前
I&#x27;ve had good luck with the tools that come with xpdf:<p><a href="http://www.foolabs.com/xpdf/about.html" rel="nofollow">http:&#x2F;&#x2F;www.foolabs.com&#x2F;xpdf&#x2F;about.html</a><p>But some of that is because the source I was pulling text from didn&#x27;t change the document format much from month to month.<p>I guess it is the library underneath jeffmould&#x27;s link.
jeffmould大约 11 年前
I have used the following with some success:<p><a href="http://pdftohtml.sourceforge.net/" rel="nofollow">http:&#x2F;&#x2F;pdftohtml.sourceforge.net&#x2F;</a><p>Not sure how well maintained it is still, but it did a good job of converting basic PDF files to HTML.<p>There is also a Google Code product for going from HTML to PDF which works pretty well.