TechEcho

3 comments

closeparenalmost 4 years ago

The space between "this shouldn't be happening" and "here's exactly what's going wrong and the steps to fix it" is at least half of the intellectual content of software engineering. If you really want to have comprehensive, actionable runbooks then you need to budget at least as long to develop those as you did to develop the software. I think it's actually pretty reasonable that this is usually computed "on demand" for the failures that actually happen. Unless you're doing the Space Shuttle or something.

dredmorbiusalmost 4 years ago

I'd written my own commentary on Bob's exellent suggestions at the time: <a href="https://old.reddit.com/r/dredmorbius/comments/2j9xri/alerting_response_google_site_reliability/" rel="nofollow">https://old.reddit.com/r/dredmorbius/comments/2j9xri/alertin...</a>Today I'd update my thoughts in a few directions.Operations is far more about risk management, mitigation, preemption, and response than other elements of tech work. Monitoring is part of that, but so is drilling, analysis, and proceduralising response. Selling ops as a risk-management strategy might also address the comprehension failures management often reveals in response to concerns raised, or cost-based objections. See: <a href="https://joindiaspora.com/posts/647a6300a2220139344a002590d8e506" rel="nofollow">https://joindiaspora.com/posts/647a6300a2220139344a002590d8e...</a>Another piece I've written on problem resolution applies: <a href="https://old.reddit.com/r/dredmorbius/comments/2fsr0g/hierarchy_of_failures_in_problem_resolution/" rel="nofollow">https://old.reddit.com/r/dredmorbius/comments/2fsr0g/hierarc...</a>On "cause-based alerts": there are states and there are levels, and sometimes what you want to track are levels. Available storage, filehandles, sockets, or other consumable resources would be well-worth logging and alerting on. Key is that the alerts should trigger some action, which is often not the case."Normalization of Deviance" is a term used by Diane Vaughan (I also associate it with Charles Perrow of Normal Accidents), and poses a potential ... risk ... of Ewaschuk's approach: alerts which don't lead to an actionable event might be either ignored or muted, though they do in fact represent risks. Normalising deviance leads to eventual catastrophic failures: the Challenger and Columbia disasters, Three Mile Island and Chernobyl, Champlain Towers South, ...<a href="https://en.wikipedia.org/wiki/Normalization_of_deviance" rel="nofollow">https://en.wikipedia.org/wiki/Normalization_of_deviance</a>

mdanielalmost 4 years ago

And to complete the loop, the HN link at the top of that doc: <a href="https://news.ycombinator.com/item?id=8450147" rel="nofollow">https://news.ycombinator.com/item?id=8450147</a> (which for some reason never turned to a live hyperlink for me, I had to use the PDF version to extract it)

3 comments

closeparenalmost 4 years ago

dredmorbiusalmost 4 years ago

mdanielalmost 4 years ago

My Philosophy on Alerting (2014)

3 comments

My Philosophy on Alerting (2014)

3 comments