TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Storing and processing less than 1TB of unstructured data?

2 pointsby johnnycarcinover 5 years ago
Many HN folks deal with data problems, so I thought I&#x27;d ask: if you had to store and index less than 1TB of unstructured (plain text) data, what would you use?<p>I have a bunch of text files and HTML pages that I&#x27;d like to dump into something and then be able to search over it, maybe even be able to find relationships (common terms, phrases, etc) between the various docs. I&#x27;ve heard of things like hadoop, but that seems to be overkill for the amount of data I have. I&#x27;d also like to keep things as low-cost as possible as this is just for personal use. I&#x27;ve looked at a few of the cloud providers but am honestly not sure what I&#x27;m looking for, so I find myself walking away more confused than when I started.<p>This seems like an easy problem, but for whatever reason I&#x27;m getting wrapped around the axle on it.

3 comments

dekhnover 5 years ago
I recommend the book &quot;Managing Gigabytes&quot;, which while dated is still relevant. The title doesn&#x27;t indicate this, but it&#x27;s heavily focused on data structures for indexing text documents.<p>But Elasticsearch running on a cloud VM with an attached EBS volume would be a fast way to get work done.
1e10over 5 years ago
1tb is nothing these days. If you insist on cloud the hetzner could be best bang for buck. Otherwise a similar desktop system can be acquired for less than 1000 usd.<p>I’d start with solr or elasticsearch and a simple indexing script (home rolled python script).<p>Then you can use solr admin or something like Jupyter for iterative querying.<p>I’m not an expert on index tuning, but you might even be able to dump it all into postgres with json types.<p>Best of luck!
评论 #21705840 未加载
johnnycarcinover 5 years ago
coming back, i stumbled over this while looking at options: <a href="https:&#x2F;&#x2F;docs.alephdata.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;docs.alephdata.org&#x2F;</a>. It is a bit more heavyweight than plain elasticsearch, but it has some nice additions that might make it worth it depending on your situation.