I would add - you should start your monitoring with business metrics. Monitoring low level things is good to have but putting whole emphasis on it is missing the whole point. You should be able to answer at any point of time whether users are having a problem, what problem, how many users, what are they doing to workaround?<p>In other words, when person is in ER, doctors are looking for heartbeat, temperature, ... , not for some low level metric, like how many grams of oxygen is consumed by some specific cell.
This is a good guide. One thing I'd add:<p>While you're monitoring for traffic/errors/latency throw in minimum success rate. Make a good estimate of how many successful operations monitored systems will do per minute on the slowest hour of the year and put in an alert if the throughput drops <i>below</i> that. You'd be surprised how many 0 errors/0 successes faults happen in a complex system, and a minimum throughput alarm will catch them.
Nice article.
The amount of in-house procedures/tooling developed by the backend seems impressive (maybe some not invented here syndrome but can't judge really).
What I am astonished at though, is that the backend part of Über seems so professional while the Uber Android app feels like it's build by 2 junior outsourced devs. Have rarely used an app which felt so buggy and awkward. (aside from regular crashes, e.g. when registering for Jump-bikes in Germany inside the app, I had to restart the app to have the corresponding menue item appear).
You need to have a strategy for backward (and forward) compatibility for your components. If the environment is large enough, you don't exactly know which component is running what version of the code as they are constantly (holding back on) upgrading some part of your system. This includes extra (paramaters to) RPC calls, data type evolution, schema evolution. Without a decent strategy you'll be in over your head quickly.
(Tip : a version number for your API as part of the API v0.0.1, ain't gonna be enough)
Interesting. Although it is on the lite side. For example, it doesn't talk about chaos testing, defining effective and comprehensive metrics (KPIs), alert noise or running services like databases in an active-active (hot-hot) mode.
> I like to think of the effort to operate a distributed system being similar to operating a large organization, like a hospital.<p>Clearly never worked for a hospital. Hospitals need good engineers (and often don’t have them). Our ‘nines’ are embarrassing...
Did you do all these things by yourself?<p>Really great content, but was really taken back by “I” used everywhere. Maybe it’s a new thing that I am not hip on that I ought to try - “I built and ran transaction processing software for Bloomberg! This is what I learned!”<p>But perhaps you really did all that by yourself, in that case sorry that i doubted you, looks like it’s a lot.
Don't really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can't possibly be a "must-have" if there are extremely large distributed systems operating without it.<p>If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.