Came across this which gives good insight into the 4 golden signals for a top-level health tracking: <a href="https://blog.netsil.com/the-4-golden-signals-of-api-health-and-performance-in-cloud-native-applications-a6e87526e74#.uzu89hl16" rel="nofollow">https://blog.netsil.com/the-4-golden-signals-of-api-health-a...</a><p>One thing of note in the graph is the tracking of response size. This would be very useful for 200 responses with "Error" in the text. Because then the response size would drop drastically below a normal successful response payload size.<p>In addition to Latency, Error Rates, Throughput and Saturation , folks like Brendan Gregg @ Netflix have recommended tracking capacity.
> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.<p>I recommend against this, rather have one overall duration metric and another metric tracking a count of failures.<p>The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed.<p>The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page.<p>This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: <a href="https://prometheus.io/docs/practices/instrumentation/" rel="nofollow">https://prometheus.io/docs/practices/instrumentation/</a>
While this is good advice, I feel it is a bit too over-simplified.<p>Counting incoming and outgoing requests misses a lot of potential data points when determining "is this my fault?"<p>I work mainly in system integrations. If I check for the ratio for input:output, then I may miss that some service providers return a 200 with a body of "<message>Error</message>".<p>A better message is to make sure your systems are knowledgeable in how data is received from downstream submissions, and to have a universal way of translating that feedback to a format your own service understands.<p>HTTP codes are (pretty much) universal. But let's say you forgot to inlcude a header or forgot to base64 encode login details or simply are using a wrong value for an API key. If your system knows that "this XML element means Y for provider X, and means Z in our own system", then you can better gauge issues as they come up, instead of waiting for customers to complain. This is also where tools like Splunk are handy, so you can be alerted to these kinds of errors as they come up.
> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.<p>This is so much easier said than done. Most time series db that people use to instrument things quite simply cannot handle histogram data correctly. They make incorrect assumptions about the way roll-ups can happen or they require you to be specific about resolution requirements before you can know them well.<p>Then histogram data tends to be very expensive to query so it bogs down preventing you from making the kinds of queries that are really valuable for diagnosing performance regressions.<p>Finally, the visualization systems for histograms are really difficult because you need a third dimension to see them over time. Heat maps accomplish this but are hard to read at times and most dashboard systems don't have great visualization options for "show this time period next to this time period" which is an incredibly common requirement when comparing latency histograms.
The whole series:<p><a href="https://honeycomb.io/blog/categories/instrumentation/" rel="nofollow">https://honeycomb.io/blog/categories/instrumentation/</a>
Author appears to use "downstream" and "upstream" to refer to "further down the stack" and "further up the stack".<p>Is this normal usage? Seems reversed to me.
Saturation and utilization are different things. For CPU time, utilization would be how many cycles were spent running user tasks over total cycles. Saturation would be how much time (cycles?) was spent in the run-queue. For disks, utilization could be IOPS, saturation is time spent in I/O wait or queue sizes. For network interface, utilization could be Gbps, saturation is total time spent waiting to write to sendq.
Every request to and from the app should be instrumented. Paying addition to the requests to the app is a good start - but you really need detailed instrumentation of all downstream dependencies your service uses to process it's requests to understand where the issue is. It's often likely you're slow or throwing errors because a dependency you use is slow or throwing errors. Or maybe the upstream service complaining has changed it's request pattern and they're making more expensive queries. There is often a small minority of requests that are responsible for most of the performance issues so even if the overall volume hasn't changed, the composition and type of requests matter as well.
This is a very useful, common-sense post. I've created exactly that sort of thing using redis. The expiry mechanism combined with formatting date strings to create time buckets for keys allows quite a bit of power and is simple to write.