So there is one obscure metric "service is available, i.e. can do its job", and this metric has different attributes: there are actual metric values (SLIs), there are internal goals (SLOs) and there are legally binding promises (SLAs) to users/customers. I would argue that this is not much content here.<p>Content, imo, would be something like this: We define "available" as "processor_load<99% and disk_load<99% and ram_load<99% and server responds with http 200 on port xyz", because reason_a, reason_b, reason_c. But other people could argue that it is not as much about the node but about how service_x is experienced, so one could track the speed of http responses to user requests and they should be under 0.1sec over 95% of the time. etc...<p>That you should track metrics, that you should set goals, and that you should define SLAs with your customers/users is standard business practice, not new knowledge.