Time-Series Database Requirements

94 pointsby eaxitectabout 10 years ago

14 comments

damian2000about 10 years ago

Anyone heard of Operational Data Historians? Like OSISoft's PI, Honeywell's PHD, GE's Proficy. They're expensive suites of software that have optimal real-time ability, plus historical access. They usually work in process control/operations of factories or manufacturing plants. Each item being measured is called a tag.Just thought I'd throw this out there since its a specialised area that not many people know about. I've done some work with them in terms of writing adaptors to a time series data visualisation product.<a href="http://en.wikipedia.org/wiki/Operational_historian" rel="nofollow">http://en.wikipedia.org/wiki/Operational_historian</a><a href="http://en.wikipedia.org/wiki/OSIsoft" rel="nofollow">http://en.wikipedia.org/wiki/OSIsoft</a><a href="https://www.honeywellprocess.com/en-US/training/programs/advanced-applications/Pages/uniformance-phd.aspx" rel="nofollow">https://www.honeywellprocess.com/en-US/training/programs/adv...</a><a href="http://www.geautomation.com/products/proficy-historian" rel="nofollow">http://www.geautomation.com/products/proficy-historian</a>On the topic of Historians vs Relational Databases, theres a blog post here about it ...<a href="https://osipi.wordpress.com/2010/07/05/relational-database-vs-process-historian-for-process-data-use-both/" rel="nofollow">https://osipi.wordpress.com/2010/07/05/relational-database-v...</a>... admittedly this is by the developer OSISoft so it may be biased, but their points seem valid. Especially the swinging door algorithm reference and the fact they are far more efficient in storage.

评论 #9174508 未加载

digitalzombieabout 10 years ago

Cassandra seems like a good fit.Writes are faster than read, it's an AP, and you shouldn't really update frequently it unless you want tombstone hell. There's also TTL too.Is there any cons of using Cassandra as a Time Series Database? I'd like to hear it.The biggest thing for Cassandra is you should know your queries before hand before you data model.

评论 #9168578 未加载

评论 #9170418 未加载

评论 #9168560 未加载

obstinateabout 10 years ago

Multi-dimensionality optional/drawback? Strongly disagree.This may be my experience as a Googler talking, but I also somewhat disagree with the notion that the data can't fit in memory. I operate a X0,000 task service, and our monitoring data could fit into memory in a large server, if need be.Of course, we don't keep per-task data permanently, that would be prohibitive at a 5s monitoring interval like the one we use, even if it were put on disk. Instead, we accomplish what I described by aggregating away some dimensions, mostly task number, and then holding the aggregated series in memory, for fast queries. There are some nuances, particularly around having foresight over the cases where you do want to see individual tasks.But suffice it to say that this person's experience does not match mine in terms of what I need from a TSDB. Perhaps his ops background comes from a different set of needs than mine, but if you're building a TSDB for many customers, I wouldn't take this list as gospel.

评论 #9170565 未加载

k1w1about 10 years ago

I built exactly the database that you are describing here. Unfortunately it is not open source. However it is no secret that we used sqlite for the storage. Inserts and queries in sqlite are fast, even for databases with billions of rows. Deletes are very, very slow, so we created a new sqlite database for each week, and simply deleted an entire database file when the retention period expired. Sqlite is ACID, supports all of the complex queries that could be imagined and is easy to embed into the overall engine code base. We use aggregate tables to more efficiently display common aggregations of the data (e.g. graph metric over the last day, or average metric over the last date). The aggregates were updated as data was inserted, so that real-time views were always available.

评论 #9169367 未加载

thrownaway2424about 10 years ago

"Reads need to be fast, even though they are rare. There are generally two approaches to dealing with this. The first is to write efficiently, so the data isn’t read-optimized per-series on disk, and deploy massive amounts of compute power in parallel for reads, scanning through all the data linearly. The second is to pay a penalty on writes, so the data is tightly packed by series and optimized for sequential reads of a series."As the little girl says in the GIF: why don't we have both? Write to a write-optimized store of limited size that requires full access during reads, and re-write that into a read-optimized format hourly or daily. Because it's limited in size, you won't care that the most recent data isn't very efficient for reading, or isn't particularly compact.

manigandhamabout 10 years ago

Recently came across Prometheus (<a href="http://prometheus.io" rel="nofollow">http://prometheus.io</a>)There's also OpenTSDB (<a href="http://opentsdb.net" rel="nofollow">http://opentsdb.net</a>) that's been around for a while.

评论 #9168577 未加载

rodionosabout 10 years ago

I'm one of the developers behind Axibase Time-Series Database which runs on top of HBase. ATSD is two years into development and has a built-in rule engine, forecasting,and visualization: <a href="http://axibase.com/products/axibase-time-series-database/visualization/" rel="nofollow">http://axibase.com/products/axibase-time-series-database/vis...</a>. The rule engine allows you to write expressions such as abs(forecast_deviation(avg())) > 2.0 to trigger url/email/command actions if sliding window average is outside of 2.0 sigmas from Holt-Winters/ARIMA forecast.The license is commercial and there's a free CE version which can be scaled vertically without any throughput constraints. Tags are supported for series as well as for entities and metrics to avoid storing long-term metadata such as location, type, category etc. along with data itself.I wouldn't be surprised if functional differences between TSDBs and historians will disappear in just a few years. Right now the historians are good at compressing repetitive data at source and on disk which makes sense given their heritage in archiving data from SCADA systems.

welderabout 10 years ago

This sounds like a job for <a href="http://www.aerospike.com/" rel="nofollow">http://www.aerospike.com/</a>

评论 #9171113 未加载

评论 #9171671 未加载

Rapzidabout 10 years ago

I would actually be very interested in hearing more about this MySQL implementation.

评论 #9169003 未加载

nimishabout 10 years ago

Druid is quite a good one

评论 #9169403 未加载

aaa667about 10 years ago

Would be keen to know what people think about: <a href="https://github.com/ambiata/ivory" rel="nofollow">https://github.com/ambiata/ivory</a>

sethevabout 10 years ago

I'm kind of curious about the no "tagging" point. Won't there always be some data that doesn't fit the [timestamp double] format?

评论 #9169030 未加载

doodlebuggingabout 10 years ago

This might not be what he's looking for but it is one way to manage huge amounts of data:[HDF5](<a href="http://www.hdfgroup.org/" rel="nofollow">http://www.hdfgroup.org/</a>)

评论 #9168966 未加载

hartrorabout 10 years ago

This smells like a job for Apache Kafka [1], I've yet to use it personally but its feature set appears to hit the mark though it lacks SQL. The application described sounds like it uses something similar to event sourcing [2] which people have used Kafka for successfully. If you're not familiar with Kafka there is a very good interview with Jun Rao [3] on se radio.[1] <a href="http://kafka.apache.org/" rel="nofollow">http://kafka.apache.org/</a>[2] <a href="http://martinfowler.com/eaaDev/EventSourcing.html" rel="nofollow">http://martinfowler.com/eaaDev/EventSourcing.html</a>[3] <a href="http://www.se-radio.net/2015/02/episode-219-apache-kafka-with-jun-rao/" rel="nofollow">http://www.se-radio.net/2015/02/episode-219-apache-kafka-wit...</a>

评论 #9168221 未加载

评论 #9168110 未加载