Hey there<p>I'm looking for a ready-to-go solution to parse documents (such as pdf, docx, pptx and others). By 'parse' I mean text extraction, including OCR if needed.
I know about Tika and tried it already, but are there any more reliable alternatives, maybe based on Tika?
I'd like to interact with it via REST API.<p>Thnx
Tika has a REST server built in <a href="https://wiki.apache.org/tika/TikaJAXRS" rel="nofollow">https://wiki.apache.org/tika/TikaJAXRS</a><p>Is there some functionality that you need and is not covered there? Some specific document extraction feature?
ElasticSearch has support for ingesting a bunch of document formats if you're already using it / looking at using it in your stack:<p><a href="https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/ingest-attachment.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/plugins/5.x/in...</a>
Yes, the <a href="https://www.ibm.com/watson/developercloud/document-conversion.html" rel="nofollow">https://www.ibm.com/watson/developercloud/document-conversio...</a> Watson Document Conversion service meets those requirements. It's not free, and it's not popular, but it's reliable.
I use 'poppler' or Apache's PDFBox for text extraction from PDF. They both can write HTML or their own XML format. In addition, they keep the absolute positioning of the layout.<p>For XML files, there is XSL-T. A simple run with the default template will give you all strings in the document, if you really want just the paragraph text, you will need to find/create an XSL transform.<p>None of these is ready to go, but very close to it. Epecially in the case of poppler and pdfbox.
Thanks guys. But none of your suggestions solve the whole problem (some don't include OCR, some support only limited file types and other). I'd like to have a black box that does everything for me (does OCR if needed, extracts pds, docs, txts and others).<p>But I'm afraid there's no such solution...
I'm afraid I may be late to the party, but I've seen <a href="https://github.com/openpaperwork/paperwork" rel="nofollow">https://github.com/openpaperwork/paperwork</a> before and it looked like a good solution for this. Never tried though.
We've been using Google Cloud Vision's OCR service with pretty good accuracy (varies from ~80-99%).<p><a href="https://cloud.google.com/vision/docs/" rel="nofollow">https://cloud.google.com/vision/docs/</a>
We've used Aspose for manipulating PDFs but they work with "over 100 file formats". Offers both SDKs and RESTful APIs<p><a href="https://www.aspose.com" rel="nofollow">https://www.aspose.com</a>