TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Is there anything similar to DocumentCloud for non-journalists?

1 pointsby bglenn09over 13 years ago
I'm looking to be able to store pdf documents in the cloud and be able to search them. DocumentCloud looks perfect but I'm not in journalism. I'm having trouble finding an obvious alternative. Do you guys like any other services or know of a simple way to do this with NoSql? I was looking into using a hosted mongodb service but I can't find any information on searching binary data. Thanks for any pointers.

1 comment

Skywingover 13 years ago
You're not going to be able to simply upload a PDF and search for text using the raw file data. It's not readable. You're going to have to either use a tool to extract embedded text, or perform OCR on the document if it's image-only. A really good tool, that I have used before, is called Aspose. If you are allowing users to upload these PDFs, you'd also need some sort of distributed task queue, because performing the PDF file operations is not something you want the user to have to wait on. I've used RabbitMQ for this, and haven't had many issues. Once you have OCR'd the document and extracted the text, then you can store the text as well as the native document in a database like MongoDB. You would maybe even benefit from using a full-text search engine, like ElasticSearch.