科技回声

10 条评论

programd大约 8 年前

Tika has a REST server built in <a href="https://wiki.apache.org/tika/TikaJAXRS" rel="nofollow">https://wiki.apache.org/tika/TikaJAXRS</a>Is there some functionality that you need and is not covered there? Some specific document extraction feature?

awinder大约 8 年前

ElasticSearch has support for ingesting a bunch of document formats if you're already using it / looking at using it in your stack:<a href="https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/ingest-attachment.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/in...</a>

评论 #14308538 未加载

kognate大约 8 年前

Yes, the <a href="https://www.ibm.com/watson/developercloud/document-conversion.html" rel="nofollow">https://www.ibm.com/watson/developercloud/document-conversio...</a> Watson Document Conversion service meets those requirements. It's not free, and it's not popular, but it's reliable.

评论 #14322336 未加载

zmix大约 8 年前

I use 'poppler' or Apache's PDFBox for text extraction from PDF. They both can write HTML or their own XML format. In addition, they keep the absolute positioning of the layout.For XML files, there is XSL-T. A simple run with the default template will give you all strings in the document, if you really want just the paragraph text, you will need to find/create an XSL transform.None of these is ready to go, but very close to it. Epecially in the case of poppler and pdfbox.

fpd4444大约 8 年前

Thanks guys. But none of your suggestions solve the whole problem (some don't include OCR, some support only limited file types and other). I'd like to have a black box that does everything for me (does OCR if needed, extracts pds, docs, txts and others).But I'm afraid there's no such solution...

rakoo大约 8 年前

I'm afraid I may be late to the party, but I've seen <a href="https://github.com/openpaperwork/paperwork" rel="nofollow">https://github.com/openpaperwork/paperwork</a> before and it looked like a good solution for this. Never tried though.

评论 #14322333 未加载

derwiki大约 8 年前

We've been using Google Cloud Vision's OCR service with pretty good accuracy (varies from ~80-99%).<a href="https://cloud.google.com/vision/docs/" rel="nofollow">https://cloud.google.com/vision/docs/</a>

hbcondo714大约 8 年前

We've used Aspose for manipulating PDFs but they work with "over 100 file formats". Offers both SDKs and RESTful APIs<a href="https://www.aspose.com" rel="nofollow">https://www.aspose.com</a>

sochix大约 8 年前

Maybe this <a href="https://rawtext.ambar.cloud/" rel="nofollow">https://rawtext.ambar.cloud/</a> ?

assafmo大约 8 年前

Tika, pdftotext, lynx (html), tesseract (ocr)

评论 #14322339 未加载

10 条评论

programd大约 8 年前

awinder大约 8 年前

评论 #14308538 未加载

kognate大约 8 年前

评论 #14322336 未加载

zmix大约 8 年前

fpd4444大约 8 年前

rakoo大约 8 年前

评论 #14322333 未加载

derwiki大约 8 年前

hbcondo714大约 8 年前

We've used Aspose for manipulating PDFs but they work with "over 100 file formats". Offers both SDKs and RESTful APIs<a href="https://www.aspose.com" rel="nofollow">https://www.aspose.com</a>

sochix大约 8 年前

Maybe this <a href="https://rawtext.ambar.cloud/" rel="nofollow">https://rawtext.ambar.cloud/</a> ?

assafmo大约 8 年前

Tika, pdftotext, lynx (html), tesseract (ocr)

评论 #14322339 未加载

Ask HN: Is there a ready-to-go solution to parse documents content?

10 条评论

Ask HN: Is there a ready-to-go solution to parse documents content?

10 条评论