In a similar vein, something about queuing that has annoyed me as a developer for multiple large FANG corporations is poor thinking about queue metrics. The TLDR is that metics provided by the queue itself are rarely helpful for knowing if your service is healthy, and when it is not healthy they are not very useful for determining why.<p>Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.<p>In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).<p>Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.<p>The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.