Does anyone know any key-value store database/service with ability to look up value at particular timestamp in the past? Something similar to Redis (or even simpler), but with the time dimension to it.<p>For example:<p>- At time 1: DB.set('foo', 'bar')<p>- At time 5: DB.set('foo', 'club')<p>- Then: DB.get('foo', 4) should return 'bar' (4 refers to the timestamp)<p>- DB.get('foo') should return 'club'<p>This database/service will help us with a particular problem we're facing at work. We looked around but have yet to find something similar to this.<p>We're thinking of writing this ourselves (a service on top of existing K-V NoSQL Database like Cassandra/Redis/LevelDB/etc). But we'd much prefer to use an existing solution.
It's not exactly a key-value store, but have a look at this:<p><a href="https://en.wikipedia.org/wiki/Temporal_database" rel="nofollow">https://en.wikipedia.org/wiki/Temporal_database</a><p>Though, if I personally had this requirement, I would just roll my own as it sounds very basic and easy to represent in a plain RDBMS. No way am I going to base my DB choice/design or infrastructure impementation for something this small.
Here's an implementation in Python built on top of Redis using sorted sets:<p><pre><code> class TimeStore:
def __init__(self, redis_client):
self.redis_client = redis_client
@staticmethod
def timestamp(time): return (time-datetime.datetime(1970,1,1)).total_seconds()
def set_value(self, key, value, time=None):
if time is None: time = datetime.datetime.utcnow()
self.redis_client.zadd(key, self.timestamp(time), value)
def get_value(self, key, time=None):
if time is None: time = datetime.datetime.utcnow()
for item in self.redis_client.zrevrangebyscore(key, self.timestamp(time), 0, start=0, num=1):
return item
return None
def cleanup_before(self, key, time):
self.redis_client.zremrangebyscore(key, 0, self.timestamp(time))</code></pre>
Cassandra is very good at storing ordered sets of data which you can then pull out the nearest value to a key. There are a load of layers on top of this that are more specifically time-series which it probably makes sense to use rather than building your own:<p>- <a href="https://code.google.com/p/kairosdb/" rel="nofollow">https://code.google.com/p/kairosdb/</a><p>- <a href="https://github.com/OpenNMS/newts" rel="nofollow">https://github.com/OpenNMS/newts</a><p>- <a href="https://github.com/rackerlabs/blueflood" rel="nofollow">https://github.com/rackerlabs/blueflood</a><p>- <a href="https://github.com/pyr/cyanite" rel="nofollow">https://github.com/pyr/cyanite</a>
You may also want to consider Aerospike database.
Aerospike is an open-source database known for its high performance. If you can represent your timestamp as an integer, you can built a secondary index on top and do a range query on it.<p>The range definition looks something like this:
Filter.range(name, begin, end) )
More details: <a href="http://www.aerospike.com/docs/client/java/usage/query/query.html" rel="nofollow">http://www.aerospike.com/docs/client/java/usage/query/query....</a><p>The advantages of Aerospike is its going to be superfast as the index is co-located with the data and also the secondary indexes are always updated inline with writes. So, the indexes always reflect the current state of truth. While doing the secondary index query, its follows a scatter gather approach thereby achieving distributed parallelism of the query. More details: <a href="http://www.aerospike.com/docs/architecture/secondary-index.html" rel="nofollow">http://www.aerospike.com/docs/architecture/secondary-index.h...</a><p>On top of this, if at all you need to perform an aggregation on the secondary query result, you can use the Aerospike's Distributed Streaming Aggregation framework (<a href="http://www.aerospike.com/docs/guide/aggregation.html" rel="nofollow">http://www.aerospike.com/docs/guide/aggregation.html</a>). Where you an do the aggregation on the server side itself without pulling all the data to the client. Only the final aggregation happen on the client.
Timestamping was described in the original BigTable paper and is included in the new open-source CockroachDB [1] (see Versioned Values in the design doc [2]).<p>A big question is whether you will be storing all mutations permanently or whether the database needs to provide either a time-to-live (TTL) or maximum # of maintained mutations.<p>If you are just keeping everything or use a keyvalue store with TTL, this can be accomplished easily with any key-value store by using fixed key sizes and appending the timestamp. In an ordered keyvalue store, you can do an efficient lookup of some time span. Some dbs like FoundationDB provide a framework to manage key space using tuples, so it's even easier to tailor and manage different key types to get the desired access speeds.<p>Rocksdb, a leveldb variant which is the engine beneath CockroachDB, does have TTL [3].<p>[1] <a href="http://cockroachdb.org" rel="nofollow">http://cockroachdb.org</a><p>[2] <a href="https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzuXxYF49ZuuDJLe6O8gBU" rel="nofollow">https://docs.google.com/document/d/11k2EmhLGSbViBvi6_zFEiKzu...</a><p>[3] <a href="https://github.com/facebook/rocksdb/wiki/Time-to-Live" rel="nofollow">https://github.com/facebook/rocksdb/wiki/Time-to-Live</a>
AWS DynamoDB is actually quite nice for this. You set a composite (id, timestamp) index, and then you query on id and a timestamp range or "first match before or at timestamp x".<p>As `zo1` mentions, though, this doesn't really seem like the kind of requirement that would necessitate obscure database technologies, especially because even in a relational database these are easily optimizable queries.
OP here, thanks a lot for the overwhelming suggestions, I'm trying to slowly digest all of them.<p>To give a little bit more context about the problem. This is related to our data analytics infrastructure that we're building (we did a technical writeup of it here - <a href="http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/" rel="nofollow">http://engineering.viki.com/blog/2014/data-warehouse-and-ana...</a> )<p>If you look at our data infrastructure diagram (in the posted link), the thing we're trying to improve is the hydration system (where it takes in a record in real-time and try to inject more time-sensitive information into it).<p>E.g. When a user watches a video (thus a video_play event sent), we want to know if it's a free user or a paid user. Since the user could be a free user today and upgrade to paid tomorrow, the only way to correctly attribute the play event to free/paid bucket is to inject that status right right into the message when it's received.<p>Building the system this way (using the hydration service) makes our service very prone to error and indeterministic (since you only have a short window to hydrate the message, and you can't replay a hydration).<p>That's why we're looking at building a historical lookup service that remembers all the different changes of a data object over time, so that replaying a hydration becomes deterministic.<p>At the moment we're processing around 100M records a day and growing. That translates to about 100M read requests. The key size should be around 1M. Not a lot but still at some scale that puts us in the position to think about scalability and performance.<p>Before asking this question, we did look around and also looked into how we'd build it using existing database technology. We thought of 3 different ways using relational DB (PostgreSQL), simple K-V store (Redis/Riak), and Cassandra (with column family). Yet we want to get more opinions of you guys :)
Have a look at FoundationDB (<a href="https://foundationdb.com" rel="nofollow">https://foundationdb.com</a>) and its time-oriented data structure[0]. If you want to go a step farther, you can use their SQL Layer to run all sorts of queries on the Key-Value Store[1].<p>[0] <a href="http://blog.foundationdb.com/designing-a-schema-for-time-series-data-using-fdb" rel="nofollow">http://blog.foundationdb.com/designing-a-schema-for-time-ser...</a><p>[1] <a href="https://foundationdb.com/layers/sql" rel="nofollow">https://foundationdb.com/layers/sql</a>
I used event sourcing for this. We were doing recording of live interactions and wanted to know the state at any given time.<p>So the "current" state of any item was the result of all events happening until now applied in order, the state at any given time was just the same, with another time boundary.<p>Don't fear the computation, we were doing all this in CouchDB with list functions and never had huge performance problems. Range queries are present in all databases.<p>Constructing the events properly so that they apply conveniently to any kind of state is a different kind of story.
HBase will fit nicely for this use case. Columns can be versioned: <a href="http://hbase.apache.org/book/versions.html" rel="nofollow">http://hbase.apache.org/book/versions.html</a>
I don't know of any dedicated package that does this, but at my previous company, we had a similar requirement, for which we used DynamoDB with a Range key, where the range key was the timestamp. So, if the value of a key 'x' was 'y' at time '4', then you could do get(x, 4) to get the value at 4 or even get(x, <5), get(x, <6) to get the value at times where a specific entry wasn't there and it will return the latest value entered at a particular time. Plus sortation by time was a plus :)
If you can't find a product below that suits you below, MIT OCW has a course on advanced data structures which covers this, they're know as persistent data structures, and the lectures on them are very informative.<p>Edit: The lectures: <a href="http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-851-advanced-data-structures-spring-2012/lecture-videos/" rel="nofollow">http://ocw.mit.edu/courses/electrical-engineering-and-comput...</a>
I'm just in the middle of building a similar system. As another commenter mentioned, you can use a log / event sourcing for this.<p>A log basically is a stream of changes to a DB. I found this the best introduction to it so far: <a href="http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying" rel="nofollow">http://engineering.linkedin.com/distributed-systems/log-what...</a>
Treode <a href="https://github.com/Treode/store" rel="nofollow">https://github.com/Treode/store</a> has this feature.<p>Speaking as someone who's used Apache Cassandra a ton, I'd stay the hell away. They break shit all the time to drive sales at Datastax and personally I'm over it.<p>Also hyperdex (hyperdex.org) is one of the better (imo) 2nd gen nosql datastores that might be worth building this on top of.
Have you considered a round robin time-series database?<p>SQL example: <a href="http://techblog.tilllate.com/2008/06/22/round-robin-data-storage-in-mysql/" rel="nofollow">http://techblog.tilllate.com/2008/06/22/round-robin-data-sto...</a><p>Research: <a href="http://pam2012.ftw.at/TMA/papers/TMA2012paper13.pdf" rel="nofollow">http://pam2012.ftw.at/TMA/papers/TMA2012paper13.pdf</a>
An interesting requirement. You might like to take a look at Irmin, which is a git-like distributed storage system. It's still being actively developed but it's far along enough to be worth kicking the tires.<p><a href="http://openmirage.org/blog/introducing-irmin" rel="nofollow">http://openmirage.org/blog/introducing-irmin</a>
My startup www.loggur.com is essentially a really advanced service that does this. Please pardon the text for now. I recently replaced the old landing page with it to prevent any confusion, as things have taken a bit of a turn (for the best). There are roughly only 3 things left to do before launching!
huy, we're in the middle of building this right now - <a href="http://gunDB.io" rel="nofollow">http://gunDB.io</a> (talks about other nifty aspects of it, too). We just got into an incubator program in SF, so if you are around that area we should meet up, send me a message (see my profile).