TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Is there anything similar to DocumentCloud for non-journalists?

1 点作者 bglenn09超过 13 年前
I'm looking to be able to store pdf documents in the cloud and be able to search them. DocumentCloud looks perfect but I'm not in journalism. I'm having trouble finding an obvious alternative. Do you guys like any other services or know of a simple way to do this with NoSql? I was looking into using a hosted mongodb service but I can't find any information on searching binary data. Thanks for any pointers.

1 comment

Skywing超过 13 年前
You're not going to be able to simply upload a PDF and search for text using the raw file data. It's not readable. You're going to have to either use a tool to extract embedded text, or perform OCR on the document if it's image-only. A really good tool, that I have used before, is called Aspose. If you are allowing users to upload these PDFs, you'd also need some sort of distributed task queue, because performing the PDF file operations is not something you want the user to have to wait on. I've used RabbitMQ for this, and haven't had many issues. Once you have OCR'd the document and extracted the text, then you can store the text as well as the native document in a database like MongoDB. You would maybe even benefit from using a full-text search engine, like ElasticSearch.