Why Not to Build a Time-Series Database

204 pointsby dgildehover 6 years ago

11 comments

tuukkahover 6 years ago

TLDR: "Why Not to Build a Time-Series Database? Because we're building one and you should pay us."> Hopefully our story will make you think twice before trying to build your own TSDB in house using open-source solutions, or if you’re really crazy, building a TSDB from scratch. Building and maintaining a TSDB is a full time job, and we have dedicated expert engineers who are constantly improving and maintaing our TSDB, and no doubt will iterate the architecture again over time as we hit an even higher magnitude of scale down the line.> Given our experience in this complex space, I would sincerely recommend you don’t try and do this at home, and if you have the money you should definitely outsource this to the experts who do this as a full time job, whether its Outlyer or another managed TSDB solution out there. As so many things turn out in computing, it’s harder than it looks!

评论 #18405075 未加载

评论 #18404570 未加载

评论 #18405724 未加载

评论 #18404361 未加载

评论 #18405238 未加载

manish_gillover 6 years ago

As I was reading through the post I kept wondering why they weren't using some warehousing technique for older data - either dump it to S3 or better yet, Google BigQuery, which is amazingly fast at that scale. They only did it after doing lots of fire-fighting and per-tenant clusters.Clickhouse would also be a good option for doing aggregating queries that TSDBs are mostly used for.One of my wishlist items in the data space is a Managed Clickhouse offering. :-)

bra-ketover 6 years ago

We used a combination of Kafka + Hbase+ Phoenix (<a href="http://phoenix.apache.org/" rel="nofollow">http://phoenix.apache.org/</a>) for similar purpose. It takes some effort to setup initial Hbase cluster but once you do it manually once and automate with Ansible /systemd it's pretty robust in operation.All our development was around query engine using plain JDBC/SQL to talk to Hbase via Phoenix. Scaling is as simple as adding a node in the cluster.

评论 #18404898 未加载

评论 #18404104 未加载

camel_gopherover 6 years ago

In case anyone is interested in a video on this, here's the talk presented at the local San Francisco monitoring group, monitorSF (<a href="https://www.meetup.com/MonitorSF/" rel="nofollow">https://www.meetup.com/MonitorSF/</a>)<a href="https://youtu.be/lA85vs6e3UA" rel="nofollow">https://youtu.be/lA85vs6e3UA</a>

marshfover 6 years ago

Time-series data handling/storage seems a mostly solved problem in the mining, oil, and manufacturing industries. Deployed in the field since the 80's,<a href="https://www.osisoft.com/about-osisoft/#more-about-pi-system" rel="nofollow">https://www.osisoft.com/about-osisoft/#more-about-pi-system</a>Discourse: Industry user now OSIsoft employee

评论 #18405034 未加载

评论 #18404244 未加载

statictypeover 6 years ago

Nice article.>its not uncommon for some of our customers to send us millions of metrics every minuteWhat kind of customers/services generate millions of points a minute?

评论 #18403398 未加载

评论 #18403420 未加载

评论 #18403332 未加载

评论 #18403520 未加载

评论 #18403354 未加载

评论 #18405150 未加载

评论 #18403589 未加载

评论 #18403334 未加载

评论 #18404586 未加载

评论 #18403341 未加载

评论 #18403310 未加载

manigandhamover 6 years ago

"time-series database" is some of the most overhyped nonsense since noSQL.Time-series is just data with time as a primary component. It comes in all shapes and volume, but if you have a lot of data and are running heavy OLAP queries than we already have an entire class of capable databases.Use any modern distributed relational column-oriented database, set primary key to metric id + timestamp, and you'll be able to scale easily with full SQL and joins. You can keep your other business data there too, along with JSON, geospatial, window functions, and all the other rich analytical queries available with relational databases.We have trillion row tables that work great. No special "TSDB" needed.

评论 #18403551 未加载

评论 #18403683 未加载

评论 #18403718 未加载

评论 #18403743 未加载

评论 #18404077 未加载

评论 #18403810 未加载

评论 #18404600 未加载

评论 #18404821 未加载

评论 #18404349 未加载

评论 #18403577 未加载

Daneel_over 6 years ago

I’m surprised no one has brought up Splunk in here (that I could see at a cursory glance).They manage to do time-series storage on a pretty large scale (over 5PB/day for their largest customer).

评论 #18404964 未加载

objektifover 6 years ago

Not very knowledgable in the area but can someone please explain how does kdb fit within this class of time series dbs and whether there are any alternatives available to kdb.

评论 #18404224 未加载

dgildehover 6 years ago

As the blog author, great to see the discussion and feedback, so appreciate it!Without going through comments one by one, the main ones about this being a solved problem or there's already solutions out there that do this, I would just say those comments remind me exactly of the type of conversations I had years ago with my team. We all thought it would be much easier or thought there would be something off the shelf that could do everything, and after several years of fire fighting, the reality was the problem looks much simpler than it really is, by a long mile.Now that we've been doing this for a few years, and spoken directly with creators of many other TSDBs, we take a very skeptical view of all claims made about any database. They all sound amazing when you first read about them, maybe even work great in testing, till you hit scale and then you find all the limitations. If there was a perfect TSDB out there, everyone would be using it and there wouldn't be a new one announced on a weekly basis!I think the one comment on query loads being different sums things up - I've no doubt all the other options thrown out there work well for data historians, but for monitoring tools with loads of concurrent users, loading dashboards with 10's or 100's of queries each, and alerting systems polling every few seconds in parallel, the query load can get very high quickly, and making those fast while still writing metrics in at scale, is a hard problem and I don't think any individual TSDB has really solved that properly, which is why we ended up building our own distributed architecture ourselves.

OldHand2018over 6 years ago

It's pretty ridiculous that "Time-Series Database" has come to mean ingesting massive amounts of streaming data. They've been around a long time and have many use cases.They're a great way to store data efficiently, accessing specific data if you know the time range you are looking for is very fast and simple, and you can roll your own in a few dozen lines of C if that's what you want to do. If that's all you need, why not?

评论 #18405324 未加载