You'd be surprised how many serious tech shops have close to zero performance metrics collected and utilised.<p>I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana.
There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.<p>However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.<p>Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.
"I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing."
Starting the article off with "I did this in one day" - complete with a massive footnote disclaiming that it obviously took a lot more than one day - kinda ruined it for me. Why even bother with that totally unnecessary claim?
There's a lot of interest in this space with respect to analytics on top of monitoring and observability data.<p>Anyone interested in this topic might want to check out an issue thread on the Thanos GitHub project. I would love to see M3, Thanos, Cortex and other Prometheus long term storage solutions all be able to benefit from a project in this space that could dynamically pull back data from any form of Prometheus long term storage using the Prometheus Remote Read protocol:
<a href="https://github.com/thanos-io/thanos/issues/2682" rel="nofollow">https://github.com/thanos-io/thanos/issues/2682</a><p>Spark and Presto both support predicate push down to a data layer, which can be a Prometheus long term metrics store, and are able to perform queries on arbitrary sets of data.<p>Spark is also super useful for ETLing data into a warehouse (such as HDFS or other backends, i.e. see the BigQuery connector for Spark[1] that could write a query from say a Prometheus long term store metrics and export it into BigQuery for further querying).<p>[1]: <a href="https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example" rel="nofollow">https://cloud.google.com/dataproc/docs/tutorials/bigquery-co...</a>
This is a really awesome blog. The post about programmer salaries is insightful: <a href="https://danluu.com/bimodal-compensation/" rel="nofollow">https://danluu.com/bimodal-compensation/</a>
Just so I understand, the simple way the headline talks about was "collect all metrics, but store the anal fraction you care about in an easily accessible place; delete the raw data every week"?<p>Title didn't live up to article imho. But I get it. Thanks for sharing your methods.
Love the section in this about using "boring technology" - and then writing about how you used it, to help counter the much more common narrative of using something exciting and new.
If we consider Graphite, InfluxDB, and Prometheus, at this point in the monitoring industry's evolution, we have the capability to easily criss-cross metrics generated in the format of one of these systems to store them in one of the other ones.<p>The missing piece remains to be able to query one system with the query language of the others. For example, query Prometheus using Graphite's query language.
Speaking of high cardinality metrics, what are good options that aren’t as custom as map reduce jobs and a bit more real time?<p>We killed our influx cluster numerous times with high cardinality metrics. We migrated to Datadog which charges based on cardinality so we actively avoid useful tags that have too much cardinality. I’m investigating Timescale since our data isn’t that big and btrees are unaffected by cardinality.
The boring technology observation (here referring to the challenge of getting publicity for "solving a 'boring' problem") is really true.<p>It extends very well to something that we constantly hammer home on my team: using boring tools is often best because it's easier to manage around known problems than forge into the unknown, especially for use-cases that don't have to do with your core business. Extreme & contrived example: it's much better to build your web backend in PHP over Rust because you're standing on the shoulders of decades of prior work, although people will definitely make fun of you at your next webdev meetup.<p>(Functionality that is core to your business is where you should differentiate and push the boundaries on interesting technology e.g. Search for Google, streaming infrastructure for Netflix. All bets are off here and this is where to reinvent the wheel if you must – this is where you earn your paycheck!)
Thank you for sharing this. I recently started working on metrics at a FAANG and saw some of the challenges you mentioned... the fact that you were able to get good results so quickly is super inspiring!
What's the standard for metrics gathering, push or pull? I prefer pull, but depending on the app it can mean you need to build in a micro HTTP server so there's something to query. That can be a PITA, but pushing a stat on every event seems wasteful, especially if there's a hot path in the code.
There have been a lot of articles posted recently about the 'old' web, and while I like the concept I still have a hard time finding quality information in many of the directories and webrings posted. The level of research and density of information in this blog is very good.
This is a great post! We should have more of these out there. Does anyone have any recommendations for similar posts for Node.js (instead of JVM)?<p>Or any good resource which discusses possible optimizations in the infra stack at a more theoretical, abstract, generalizable level?
I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the interesting metrics data makes sense in the same way that you'd take a poll of a small fraction of the population to make guesses about the whole?
Funny how "groundbreaking" stuff like this is outside e.g. Google, where you could collect and query metrics in realtime for more than a decade now.
> since i like boring, descriptive, names..<p>I feel like I have an inception. Should "boring, descriptive, names" be the default in all IT?
Great article!<p>The bit about not being able to use columns for each metric because there were too many ....<p>the classic solution is to have a column called “metric name” and another for “metric value”.<p>Can’t spot why they didn’t just do that.