I don't know Riak, other than its a distributed NoSQL key-value data store.<p>Time series has always been prevalent in the fintec and quantitative finance, and other disciplines for decades. I read a book in the early 1990s on music as time series data, financial tickers, and so on.<p>How is Riak different, or more suited to use than Kdb + q, J with JDB (free), Jd (a commercial J database like Kdb/q)[2], or the new Kerf lang/db being developed by Kevin Lawler[3]?<p>Kevin also wrote kona, an opensource version of the "K programming language"[4].<p>Kdb is very fast at time series analysis on large datasets, and has many years of proven value in the financial industry.<p>[1] <a href="https://kx.com/" rel="nofollow">https://kx.com/</a>
[2] <a href="http://www.jsoftware.com/jdhelp/overview.html" rel="nofollow">http://www.jsoftware.com/jdhelp/overview.html</a>
[3] <a href="https://github.com/kevinlawler/kerf" rel="nofollow">https://github.com/kevinlawler/kerf</a>
[4] <a href="https://github.com/kevinlawler/kona" rel="nofollow">https://github.com/kevinlawler/kona</a>
> Riak uses the SHA hash as its distribution mechanism and divides the output range of the SHA hash evenly amongst participating nodes in the cluster.<p>Wait, Riak uses SHA as distribution hash?
Why use a cryptographic hash for distribution and not something like Murmur3, if you're talking about high-performant[0] ?<p>[0] <a href="http://blog.reverberate.org/2012/01/state-of-hash-functions-2012.html" rel="nofollow">http://blog.reverberate.org/2012/01/state-of-hash-functions-...</a>
I've always found it quite curious that computer human interfaces have always focused on the noun/verb proposition of describing data, and not the time/place. Time is the only true constant in the universe, and yet computers are set up to track and control it, seemingly, as a second thought.<p>Imagine if instead of having files/folders to (teach,confuse) Grandma, we simply had a time-based system of references. If Time was a principle unit of information that a user was required to understand as an abstract concept, I feel that it would result in far better user interfaces.<p>We can see this in the Music-making world, where Time is the most significant domain over which a Musician exerts control. A DAW-like interface for managing events seems to me to be quite intuitive - for so many other non-musical applications - that its almost extraordinary that someone hasn't built an email system, or accounting system, or a graphical-design system, of applications, oriented around this aspect. (Of course, they are out there - but it seems that Time management makes the dividing line between "professional" and "dilettante" users rather thick...)
Are there performance numbers available?<p>We're on the look out for suitable remote storage for prometheus.io, and would want to know the hardware that'd be required to handle 1M samples/s and how many bytes a sample takes up.<p>It doesn't support full float64 which we need, but we could workaround by putting it into a 64 bit unsigned number.
Might be worth looking into dalmatiner.io (DalmatinerDB) as an alternative to this. It's also built on riak_core to manage cluster membership and the top-level framework for dealing with routing and rebalancing.<p>Waited for a long time for Riak TS to come out. Tried KairosDB & Cyanite, but the operational overhead of Cassandra wasn't something I wanted to buy into for such a narrow use case (infrastructure metrics store), and then suddenly out of nowhere DalmatinerDB was released. The code is clean, the architecture is solid, and the ops story is simple.<p>I don't have any affiliation of any kind with the Dataloop folks. I am however a happy end-user. We do currently use Riak KV due to its CRDT support though.
I see SQL support, that is interesting. Isn't Riak the premier NoSQL database. I guess it is a NoNoSQL db now ;-)<p>The implementation of SQL part is so neat. Great work whoever did that. It uses yecc and leex that comes with Erlang and rebar even knows how to compile those. Very cool!<p><a href="https://github.com/basho/riak_ql" rel="nofollow">https://github.com/basho/riak_ql</a>
Poses the question<p><pre><code> So what’s the big deal? People have been recording
temporally oriented data since we could chisel on tablets.
</code></pre>
Never answers it, but instead explains how Riak handles large time series. Certainly interesting, but I would like an answer to this question, as I don't understand the big deal.
As someone who deals with sensor data, the tricky part is really not the write-rate, but rather dealing with messy data. There's a lot of parallelism in sensor network streams, and for many domains you never look at the sensors from one device against the sensors of another device, so you can put them in entirely different databases and it doesn't matter. (It's not true in every case, of course, but if you're doing time series/streaming, ask yourself if it's true for you before picking a system)<p>The real pain is handling data that arrives out of order or otherwise very late, or handling data that never arrives at all, or handling data that's clearly wrong. Worse, you may have streams that are defined/calculated from other streams for some algebra on series, e.g. series C is series A plus series B - so handling new data on A means you need to recalculate/update the view for C.<p>Oh, and you'd like this all to be mostly declarative so you have some way to migrate between systems if you need to switch for whatever reason.<p>Apache Beam/Google Dataflow gets a lot of this stuff right: it's not quite as declarative as I'd like but it gets the windowing flexibility right and handles restatements at a data model level.
For the TS experts out there, any real world experience with Influx? (<a href="https://influxdata.com/" rel="nofollow">https://influxdata.com/</a>)
Time series does not necessarily have to be about 'huge' data either, just a much greater level of historical precision. Example:<p>ISP sells a circuit with 95th percentile billing to a customer.<p>If you poll SNMP data from a router interface on 60 second intervals and store it in an RRA file, you will lose a great deal of precision over time (because RRAs are highly compressed over time). You'll have no ability to go back and pull a query like "We want to see traffic stats for the DDoS this customer took at 9am on February 26th of last year".<p>with time series statistics you can then feed it into tools such as grafana for visualization.<p>An implementation such as openTSDB to grab the traffic stats for a particular SNMP OID and store it will allow you to store all traffic data forever and retrieve it as needed later on. The amount of data written per 60 second interval is miniscule, a server with a few hundred GB of SSD storage will be sufficient to store all traffic stats for relevant interfaces on core/agg routers for a fairly large sized ISP for several years.
Looks really nice. although I am bit sad to see that - it requires structured schema. I have been lookout for a metric collection system (like influxdb) and this would fit very well - except the schema part.