TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What would you use to query large (2-25TB) of read-only data?

2 pointsby hectcastroalmost 15 years ago
The data set is written once a month in bulk, and read many times by different users. The last month of data (~2TB) is the hotspot. A year's worth is ~25TB.<p>The attributes of one record are as follows:<p><pre><code> * 1-4 character string * float * float * integer * integer * integer * integer * integer * 1 character </code></pre> For each 1-4 character string, there are many records -- sometimes several per second. As an example, in the span of a month, one of these strings can be associated with 18 million records. There are about 10,000 unique 1-4 character strings, but not all as active as the previous example. The data is queried by two attributes: 1-4 character string and timestamp.<p>Potential solutions I've come up with (feel free to debate any of these):<p><pre><code> * Put everything (or just the hotspot) in a MyISAM compressed database. * Put everything (or just the hotspot) in an InnoDB database with a proper clustered index. * Put everything (or just the hotspot) into CouchDB with proper views. * Put everything (or just the hotspot) into MongoDB with proper indexes. * Put everything (or just the hotspot) into Redis ZSETs with timestamp as SCORE and distribute across nodes. * Load all of the data into a long-running Hadoop job. </code></pre> Feel free to ask any questions too.

1 comment

nolitealmost 15 years ago
It's kinda lower level, but look into Fastbit<p><a href="https://sdm.lbl.gov/fastbit/" rel="nofollow">https://sdm.lbl.gov/fastbit/</a>