TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Night of the Runbooks: a DevOps horror story

85 pointsby gus_leonelalmost 2 years ago

13 comments

sublinearalmost 2 years ago
To make this dream come true is far beyond the scope of just dev ops.<p>It&#x27;s also silly because if the problems are known this well the runbook will only be this good for a moment while the devs are working on fixing the bugs and the runbook will need to be updated constantly after each release.<p>This requires more testing than most are willing to do and for dev ops to <i>still always be on call to find the new bugs first</i> and document them this way while waiting for devs to fix them <i>and</i> for management to always reserve time to keep this updated.<p>This means there&#x27;s a long enough release cycle for this to ever happen which is also unlikely.
评论 #36322577 未加载
评论 #36322431 未加载
评论 #36322181 未加载
评论 #36324149 未加载
ang_cirealmost 2 years ago
Wait, is the idea that your SRE&#x2F; Ops person doing rollbacks, failovers, and restarts without understanding why... is a <i>good</i> thing?<p>Runbooks are supposed to give you a clear path during an incident so you don&#x27;t run down rabbit holes, but you still shouldn&#x27;t be running them without understanding what the impact of their steps actually is.<p>I thought the horror part was that some random person, who doesn&#x27;t know the system, was doing all this stuff without just contacting the people who do know it.<p>If the runbook was sufficient to handle it, great! The proper team will have just as easy a time doing the steps, and not be mucking around in something they don&#x27;t own and won&#x27;t be the ones to fix when the runbook process itself breaks something because Error A is not Error J, but un-familiarized person can&#x27;t tell the difference.
评论 #36326019 未加载
评论 #36329323 未加载
8organicbitsalmost 2 years ago
&gt; I don&#x27;t know anything about process-payments; it belongs to a completely different team. I wouldn&#x27;t even know how to check it.<p>I get that the point of the article is to sell the value of run books, which are amazing, but are people really going on-call for the first time without any training or knowledge of the services they&#x27;ll be supporting? Even with perfect run books that seems unwise.
评论 #36321748 未加载
评论 #36322073 未加载
评论 #36321708 未加载
评论 #36322533 未加载
评论 #36321616 未加载
评论 #36323763 未加载
评论 #36324916 未加载
评论 #36323386 未加载
midasunialmost 2 years ago
What did Morgan (in the dream) actually accomplish that a small shell script couldn’t?<p>Literally each step was “if error then do automated action”<p>Alas I have emails which go to a circuit provider when their circuit goes down as they can’t seem to monitor it. It does automated triage, for example checking my router is up and powered, checking the link light on their adva etc.<p>It tells them there’s no power outage. They inevitably ask the first question. “Is there a power problem”<p>A run book, or any instructions, is only as good as the person following them.
rjmunroalmost 2 years ago
I thought it was going to turn out to be a sophisticated phishing attack.
mynegationalmost 2 years ago
I was baffled by the lack of actual troubleshooting. Ok, so graphs show increased failures, so what? What is happening? Spike of inbound traffic? Hardware failure? Increased connection pauses because of DNS? Database migration mishap? Without at least understanding the underlying reason on the high level restarts and version rollbacks may not help or may even aggravate the situation. Am I missing anything here?
评论 #36325826 未加载
slowhand09almost 2 years ago
They is responsible for this! And We is gonna make sure They pay for the time We lost working on It, to fix They&#x27;s problem.<p>We is out! No diggity.
throwawaaarrghalmost 2 years ago
As expected, this is an Ops story, not DevOps
wkdneidbwfalmost 2 years ago
haha, it makes sense this is on a site with “consulting” in the name—its unrealistic and pointless. the M. Night Shyamalan twist is that the whole thing is the “horror” portion.<p>in the story our protagonist has no knowledge of the service she is on call for and displays no ability to troubleshoot, reason about, or understand the issue she’s responding to.<p>instead, she clicks a series of buttons in a runbook to resolve the most basic and most happy-path production issue you’ll ever see.<p>aaaaand this is about devops how?
GRBLDevelopedalmost 2 years ago
Anyone here have buttons in runbooks? We normally have links to Jenkins jobs in git runbooks, what are you using?
评论 #36332868 未加载
larsrcalmost 2 years ago
And here I was waiting for it to turn out to be a phishing attempt due to clicking the link in the alert.
bad_usernamealmost 2 years ago
As a non-native English speaker, I still find it difficult to read texts where a singular person is referred to by a plural pronoun. Every &quot;they&quot; becomes ambiguous and requires more mental cycles to unpack. Is it about Morgan, or some unrelated group of people, or Morgan plus the group?<p>I understand the cultural reasons, but I regret the loss of linguistic clarity.
评论 #36323798 未加载
评论 #36322133 未加载
评论 #36322936 未加载
评论 #36322106 未加载
评论 #36324583 未加载
评论 #36322487 未加载
评论 #36323445 未加载
评论 #36322430 未加载
评论 #36322783 未加载
ameliusalmost 2 years ago
Looks more like a meme-hater&#x27;s horror story.