TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

My Philosophy on Alerting (2014)

12 pointsby mikesabbaghalmost 4 years ago

3 comments

closeparenalmost 4 years ago
The space between "this shouldn't be happening" and "here's exactly what's going wrong and the steps to fix it" is at least half of the intellectual content of software engineering. If you really want to have comprehensive, actionable runbooks then you need to budget at least as long to develop those as you did to develop the software. I think it's actually pretty reasonable that this is usually computed "on demand" for the failures that actually happen. Unless you're doing the Space Shuttle or something.
dredmorbiusalmost 4 years ago
I&#x27;d written my own commentary on Bob&#x27;s exellent suggestions at the time: <a href="https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;dredmorbius&#x2F;comments&#x2F;2j9xri&#x2F;alerting_response_google_site_reliability&#x2F;" rel="nofollow">https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;dredmorbius&#x2F;comments&#x2F;2j9xri&#x2F;alertin...</a><p>Today I&#x27;d update my thoughts in a few directions.<p>Operations is far more about risk management, mitigation, preemption, and response than other elements of tech work. Monitoring is part of that, but so is drilling, analysis, and proceduralising response. Selling ops as a risk-management strategy might also address the comprehension failures management often reveals in response to concerns raised, or cost-based objections. See: <a href="https:&#x2F;&#x2F;joindiaspora.com&#x2F;posts&#x2F;647a6300a2220139344a002590d8e506" rel="nofollow">https:&#x2F;&#x2F;joindiaspora.com&#x2F;posts&#x2F;647a6300a2220139344a002590d8e...</a><p>Another piece I&#x27;ve written on problem resolution applies: <a href="https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;dredmorbius&#x2F;comments&#x2F;2fsr0g&#x2F;hierarchy_of_failures_in_problem_resolution&#x2F;" rel="nofollow">https:&#x2F;&#x2F;old.reddit.com&#x2F;r&#x2F;dredmorbius&#x2F;comments&#x2F;2fsr0g&#x2F;hierarc...</a><p>On &quot;cause-based alerts&quot;: there are <i>states</i> and there are <i>levels</i>, and sometimes what you want to track are <i>levels</i>. Available storage, filehandles, sockets, or other consumable resources would be well-worth logging and alerting on. Key is that the alerts should trigger some <i>action</i>, which is often not the case.<p>&quot;Normalization of Deviance&quot; is a term used by Diane Vaughan (I also associate it with Charles Perrow of <i>Normal Accidents</i>), and poses a potential ... risk ... of Ewaschuk&#x27;s approach: alerts which <i>don&#x27;t</i> lead to an actionable event might be either ignored or muted, though they <i>do</i> in fact represent risks. Normalising deviance leads to eventual catastrophic failures: the <i>Challenger</i> and <i>Columbia</i> disasters, Three Mile Island and Chernobyl, Champlain Towers South, ...<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Normalization_of_deviance" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Normalization_of_deviance</a>
mdanielalmost 4 years ago
And to complete the loop, the HN link at the top of that doc: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=8450147" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=8450147</a> (which for some reason never turned to a live hyperlink for me, I had to use the PDF version to extract it)