TechEcho

I'm trying to figure out the best solution for the following:* Daily importing of ~500+ million rows of data, with ~250 million unique ids. * I need to only keep the latest X entries per unique ID. Older entries are discarded after X entries for that id has been achieved. * Monthly will read out the entire dataset for processingX can be anywhere from 1000 to 3000, it is static over the entire DB just depends on as we determine the best setting. Since I don't access the data more than once a day, or at the end of the month, I would prefer not to pay for storage. There are over a billion unique id's which I can partition by prefix or ranges. Each individual entry per ID is fairly small with only an integer and two decimals stored.What would you recommend as a data store for this?Thanks!

3 comments

prostoalexover 6 years ago

Have a job queue running lower-priority DB queries. The queue workers select all records for a given ID, then prune off older records above X.Insert a fresh record immediately, since you know it’s recent. Upon successful insert, fire off a queue request to go check on that ID.

评论 #18956965 未加载

verdvermover 6 years ago

Storage is cheap, analysis is expensive, sounds like you might be optimizing for the wrong variable. The data store of choice will likely depend on you access patterns during analysis.Flat files is my best guess for your data? HDFS

评论 #18956955 未加载

ryanworlover 6 years ago

BigQuery is probably the best price/effort ratio for this.

Ask HN: Keep Last X entries per ID in larger DB

3 comments

Ask HN: Keep Last X entries per ID in larger DB

3 comments