Instrumentation: The First Four Things You Measure

239 pointsby cyenover 8 years ago

11 comments

ackerman80over 8 years ago

Came across this which gives good insight into the 4 golden signals for a top-level health tracking: <a href="https://blog.netsil.com/the-4-golden-signals-of-api-health-and-performance-in-cloud-native-applications-a6e87526e74#.uzu89hl16" rel="nofollow">https://blog.netsil.com/the-4-golden-signals-of-api-health-a...</a>One thing of note in the graph is the tracking of response size. This would be very useful for 200 responses with "Error" in the text. Because then the response size would drop drastically below a normal successful response payload size.In addition to Latency, Error Rates, Throughput and Saturation , folks like Brendan Gregg @ Netflix have recommended tracking capacity.

评论 #13495442 未加载

评论 #13496522 未加载

bbrazilover 8 years ago

> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.I recommend against this, rather have one overall duration metric and another metric tracking a count of failures.The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed.The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page.This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: <a href="https://prometheus.io/docs/practices/instrumentation/" rel="nofollow">https://prometheus.io/docs/practices/instrumentation/</a>

评论 #13495399 未加载

评论 #13494699 未加载

anwover 8 years ago

While this is good advice, I feel it is a bit too over-simplified.Counting incoming and outgoing requests misses a lot of potential data points when determining "is this my fault?"I work mainly in system integrations. If I check for the ratio for input:output, then I may miss that some service providers return a 200 with a body of "<message>Error</message>".A better message is to make sure your systems are knowledgeable in how data is received from downstream submissions, and to have a universal way of translating that feedback to a format your own service understands.HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a header or forgot to base64 encode login details or simply are using a wrong value for an API key. If your system knows that "this XML element means Y for provider X, and means Z in our own system", then you can better gauge issues as they come up, instead of waiting for customers to complain. This is also where tools like Splunk are handy, so you can be alerted to these kinds of errors as they come up.

评论 #13496422 未加载

评论 #13500092 未加载

kasey_junkover 8 years ago

> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.This is so much easier said than done. Most time series db that people use to instrument things quite simply cannot handle histogram data correctly. They make incorrect assumptions about the way roll-ups can happen or they require you to be specific about resolution requirements before you can know them well.Then histogram data tends to be very expensive to query so it bogs down preventing you from making the kinds of queries that are really valuable for diagnosing performance regressions.Finally, the visualization systems for histograms are really difficult because you need a third dimension to see them over time. Heat maps accomplish this but are hard to read at times and most dashboard systems don't have great visualization options for "show this time period next to this time period" which is an incredibly common requirement when comparing latency histograms.

评论 #13511271 未加载

sambeover 8 years ago

The whole series:<a href="https://honeycomb.io/blog/categories/instrumentation/" rel="nofollow">https://honeycomb.io/blog/categories/instrumentation/</a>

techbioover 8 years ago

Author appears to use "downstream" and "upstream" to refer to "further down the stack" and "further up the stack".Is this normal usage? Seems reversed to me.

评论 #13494652 未加载

评论 #13496108 未加载

评论 #13495804 未加载

评论 #13494769 未加载

jdormitover 8 years ago

Is the last paragraph a joke? If so, could someone explain it?

评论 #13495126 未加载

greenleafjacobover 8 years ago

Saturation and utilization are different things. For CPU time, utilization would be how many cycles were spent running user tasks over total cycles. Saturation would be how much time (cycles?) was spent in the run-queue. For disks, utilization could be IOPS, saturation is time spent in I/O wait or queue sizes. For network interface, utilization could be Gbps, saturation is total time spent waiting to write to sendq.

siliconc0wover 8 years ago

Every request to and from the app should be instrumented. Paying addition to the requests to the app is a good start - but you really need detailed instrumentation of all downstream dependencies your service uses to process it's requests to understand where the issue is. It's often likely you're slow or throwing errors because a dependency you use is slow or throwing errors. Or maybe the upstream service complaining has changed it's request pattern and they're making more expensive queries. There is often a small minority of requests that are responsible for most of the performance issues so even if the overall volume hasn't changed, the composition and type of requests matter as well.

vaishaksureshover 8 years ago

Off Topic: Does anyone know what tools the author uses to make the diagrams?

评论 #13494568 未加载

grandalfover 8 years ago

This is a very useful, common-sense post. I've created exactly that sort of thing using redis. The expiry mechanism combined with formatting date strings to create time buckets for keys allows quite a bit of power and is simple to write.