A simple way to get more value from metrics

280 pointsby waffle_ssalmost 5 years ago

20 comments

bitcharmeralmost 5 years ago

You'd be surprised how many serious tech shops have close to zero performance metrics collected and utilised.I've done this in fintech a few times already and the best stack that worked from my experience was telegraf + influxdb + grafana. There are many things you can get wrong with metrics collection (what you collect, how, units, how the data is stored, aggregated and eventually presented) and I learned about most of them the hard way.However, when done right and covering all layers of your software execution stack this can be a game changer, both in terms of capacity planing/picking low hanging perf fruit and day to day operations.Highly recommend giving it a try as the tools are free, mature and cover a wide spectrum of platforms and services.

评论 #23365594 未加载

评论 #23361478 未加载

评论 #23361431 未加载

评论 #23365507 未加载

jamessunalmost 5 years ago

"I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing."

评论 #23361718 未加载

评论 #23363516 未加载

评论 #23361625 未加载

resu_nimdaalmost 5 years ago

Starting the article off with "I did this in one day" - complete with a massive footnote disclaiming that it obviously took a lot more than one day - kinda ruined it for me. Why even bother with that totally unnecessary claim?

评论 #23362637 未加载

评论 #23362682 未加载

评论 #23361996 未加载

评论 #23362784 未加载

评论 #23362191 未加载

roskillialmost 5 years ago

There's a lot of interest in this space with respect to analytics on top of monitoring and observability data.Anyone interested in this topic might want to check out an issue thread on the Thanos GitHub project. I would love to see M3, Thanos, Cortex and other Prometheus long term storage solutions all be able to benefit from a project in this space that could dynamically pull back data from any form of Prometheus long term storage using the Prometheus Remote Read protocol: <a href="https://github.com/thanos-io/thanos/issues/2682" rel="nofollow">https://github.com/thanos-io/thanos/issues/2682</a>Spark and Presto both support predicate push down to a data layer, which can be a Prometheus long term metrics store, and are able to perform queries on arbitrary sets of data.Spark is also super useful for ETLing data into a warehouse (such as HDFS or other backends, i.e. see the BigQuery connector for Spark[1] that could write a query from say a Prometheus long term store metrics and export it into BigQuery for further querying).[1]: <a href="https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example" rel="nofollow">https://cloud.google.com/dataproc/docs/tutorials/bigquery-co...</a>

评论 #23366960 未加载

gigatexalalmost 5 years ago

This is a really awesome blog. The post about programmer salaries is insightful: <a href="https://danluu.com/bimodal-compensation/" rel="nofollow">https://danluu.com/bimodal-compensation/</a>

评论 #23362214 未加载

评论 #23361457 未加载

评论 #23361432 未加载

renewiltordalmost 5 years ago

Just so I understand, the simple way the headline talks about was "collect all metrics, but store the anal fraction you care about in an easily accessible place; delete the raw data every week"?Title didn't live up to article imho. But I get it. Thanks for sharing your methods.

simonwalmost 5 years ago

Love the section in this about using "boring technology" - and then writing about how you used it, to help counter the much more common narrative of using something exciting and new.

评论 #23363557 未加载

heliodoralmost 5 years ago

If we consider Graphite, InfluxDB, and Prometheus, at this point in the monitoring industry's evolution, we have the capability to easily criss-cross metrics generated in the format of one of these systems to store them in one of the other ones.The missing piece remains to be able to query one system with the query language of the others. For example, query Prometheus using Graphite's query language.

sa46almost 5 years ago

Speaking of high cardinality metrics, what are good options that aren’t as custom as map reduce jobs and a bit more real time?We killed our influx cluster numerous times with high cardinality metrics. We migrated to Datadog which charges based on cardinality so we actively avoid useful tags that have too much cardinality. I’m investigating Timescale since our data isn’t that big and btrees are unaffected by cardinality.

评论 #23368940 未加载

staysaasyalmost 5 years ago

The boring technology observation (here referring to the challenge of getting publicity for "solving a 'boring' problem") is really true.It extends very well to something that we constantly hammer home on my team: using boring tools is often best because it's easier to manage around known problems than forge into the unknown, especially for use-cases that don't have to do with your core business. Extreme & contrived example: it's much better to build your web backend in PHP over Rust because you're standing on the shoulders of decades of prior work, although people will definitely make fun of you at your next webdev meetup.(Functionality that is core to your business is where you should differentiate and push the boundaries on interesting technology e.g. Search for Google, streaming infrastructure for Netflix. All bets are off here and this is where to reinvent the wheel if you must – this is where you earn your paycheck!)

mv4almost 5 years ago

Thank you for sharing this. I recently started working on metrics at a FAANG and saw some of the challenges you mentioned... the fact that you were able to get good results so quickly is super inspiring!

评论 #23366973 未加载

chrchang523almost 5 years ago

Minor nit: long -> double -> long cannot introduce more rounding error than long -> double, if the same long type is at both ends.

dirtydroogalmost 5 years ago

What's the standard for metrics gathering, push or pull? I prefer pull, but depending on the app it can mean you need to build in a micro HTTP server so there's something to query. That can be a PITA, but pushing a stat on every event seems wasteful, especially if there's a hot path in the code.

评论 #23362319 未加载

评论 #23361675 未加载

chris_falmost 5 years ago

There have been a lot of articles posted recently about the 'old' web, and while I like the concept I still have a hard time finding quality information in many of the directories and webrings posted. The level of research and density of information in this blog is very good.

neoplatonianalmost 5 years ago

This is a great post! We should have more of these out there. Does anyone have any recommendations for similar posts for Node.js (instead of JVM)?Or any good resource which discusses possible optimizations in the infra stack at a more theoretical, abstract, generalizable level?

dmos62almost 5 years ago

I'm not sure I understood the solution there. Storing only 0.1%-0.01% of the interesting metrics data makes sense in the same way that you'd take a poll of a small fraction of the population to make guesses about the whole?

评论 #23361477 未加载

wwarneralmost 5 years ago

Would be a very natural AWS dashboard.

评论 #23362085 未加载

m0zgalmost 5 years ago

Funny how "groundbreaking" stuff like this is outside e.g. Google, where you could collect and query metrics in realtime for more than a decade now.

dandarealmost 5 years ago

> since i like boring, descriptive, names..I feel like I have an inception. Should "boring, descriptive, names" be the default in all IT?

评论 #23364433 未加载

willvarfaralmost 5 years ago

Great article!The bit about not being able to use columns for each metric because there were too many ....the classic solution is to have a column called “metric name” and another for “metric value”.Can’t spot why they didn’t just do that.

评论 #23361544 未加载