TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Is there a ready-to-go solution to parse documents content?

27 点作者 fpd4444大约 8 年前
Hey there<p>I&#x27;m looking for a ready-to-go solution to parse documents (such as pdf, docx, pptx and others). By &#x27;parse&#x27; I mean text extraction, including OCR if needed. I know about Tika and tried it already, but are there any more reliable alternatives, maybe based on Tika? I&#x27;d like to interact with it via REST API.<p>Thnx

10 条评论

programd大约 8 年前
Tika has a REST server built in <a href="https:&#x2F;&#x2F;wiki.apache.org&#x2F;tika&#x2F;TikaJAXRS" rel="nofollow">https:&#x2F;&#x2F;wiki.apache.org&#x2F;tika&#x2F;TikaJAXRS</a><p>Is there some functionality that you need and is not covered there? Some specific document extraction feature?
awinder大约 8 年前
ElasticSearch has support for ingesting a bunch of document formats if you&#x27;re already using it &#x2F; looking at using it in your stack:<p><a href="https:&#x2F;&#x2F;www.elastic.co&#x2F;guide&#x2F;en&#x2F;elasticsearch&#x2F;plugins&#x2F;5.x&#x2F;ingest-attachment.html" rel="nofollow">https:&#x2F;&#x2F;www.elastic.co&#x2F;guide&#x2F;en&#x2F;elasticsearch&#x2F;plugins&#x2F;5.x&#x2F;in...</a>
评论 #14308538 未加载
kognate大约 8 年前
Yes, the <a href="https:&#x2F;&#x2F;www.ibm.com&#x2F;watson&#x2F;developercloud&#x2F;document-conversion.html" rel="nofollow">https:&#x2F;&#x2F;www.ibm.com&#x2F;watson&#x2F;developercloud&#x2F;document-conversio...</a> Watson Document Conversion service meets those requirements. It&#x27;s not free, and it&#x27;s not popular, but it&#x27;s reliable.
评论 #14322336 未加载
zmix大约 8 年前
I use &#x27;poppler&#x27; or Apache&#x27;s PDFBox for text extraction from PDF. They both can write HTML or their own XML format. In addition, they keep the absolute positioning of the layout.<p>For XML files, there is XSL-T. A simple run with the default template will give you all strings in the document, if you really want just the paragraph text, you will need to find&#x2F;create an XSL transform.<p>None of these is ready to go, but very close to it. Epecially in the case of poppler and pdfbox.
fpd4444大约 8 年前
Thanks guys. But none of your suggestions solve the whole problem (some don&#x27;t include OCR, some support only limited file types and other). I&#x27;d like to have a black box that does everything for me (does OCR if needed, extracts pds, docs, txts and others).<p>But I&#x27;m afraid there&#x27;s no such solution...
rakoo大约 8 年前
I&#x27;m afraid I may be late to the party, but I&#x27;ve seen <a href="https:&#x2F;&#x2F;github.com&#x2F;openpaperwork&#x2F;paperwork" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;openpaperwork&#x2F;paperwork</a> before and it looked like a good solution for this. Never tried though.
评论 #14322333 未加载
derwiki大约 8 年前
We&#x27;ve been using Google Cloud Vision&#x27;s OCR service with pretty good accuracy (varies from ~80-99%).<p><a href="https:&#x2F;&#x2F;cloud.google.com&#x2F;vision&#x2F;docs&#x2F;" rel="nofollow">https:&#x2F;&#x2F;cloud.google.com&#x2F;vision&#x2F;docs&#x2F;</a>
hbcondo714大约 8 年前
We&#x27;ve used Aspose for manipulating PDFs but they work with &quot;over 100 file formats&quot;. Offers both SDKs and RESTful APIs<p><a href="https:&#x2F;&#x2F;www.aspose.com" rel="nofollow">https:&#x2F;&#x2F;www.aspose.com</a>
sochix大约 8 年前
Maybe this <a href="https:&#x2F;&#x2F;rawtext.ambar.cloud&#x2F;" rel="nofollow">https:&#x2F;&#x2F;rawtext.ambar.cloud&#x2F;</a> ?
assafmo大约 8 年前
Tika, pdftotext, lynx (html), tesseract (ocr)
评论 #14322339 未加载