To make this dream come true is far beyond the scope of just dev ops.<p>It's also silly because if the problems are known this well the runbook will only be this good for a moment while the devs are working on fixing the bugs and the runbook will need to be updated constantly after each release.<p>This requires more testing than most are willing to do and for dev ops to <i>still always be on call to find the new bugs first</i> and document them this way while waiting for devs to fix them <i>and</i> for management to always reserve time to keep this updated.<p>This means there's a long enough release cycle for this to ever happen which is also unlikely.
Wait, is the idea that your SRE/ Ops person doing rollbacks, failovers, and restarts without understanding why... is a <i>good</i> thing?<p>Runbooks are supposed to give you a clear path during an incident so you don't run down rabbit holes, but you still shouldn't be running them without understanding what the impact of their steps actually is.<p>I thought the horror part was that some random person, who doesn't know the system, was doing all this stuff without just contacting the people who do know it.<p>If the runbook was sufficient to handle it, great! The proper team will have just as easy a time doing the steps, and not be mucking around in something they don't own and won't be the ones to fix when the runbook process itself breaks something because Error A is not Error J, but un-familiarized person can't tell the difference.
> I don't know anything about process-payments; it belongs to a completely different team. I wouldn't even know how to check it.<p>I get that the point of the article is to sell the value of run books, which are amazing, but are people really going on-call for the first time without any training or knowledge of the services they'll be supporting? Even with perfect run books that seems unwise.
What did Morgan (in the dream) actually accomplish that a small shell script couldn’t?<p>Literally each step was “if error then do automated action”<p>Alas I have emails which go to a circuit provider when their circuit goes down as they can’t seem to monitor it. It does automated triage, for example checking my router is up and powered, checking the link light on their adva etc.<p>It tells them there’s no power outage. They inevitably ask the first question. “Is there a power problem”<p>A run book, or any instructions, is only as good as the person following them.
I was baffled by the lack of actual troubleshooting. Ok, so graphs show increased failures, so what? What is happening? Spike of inbound traffic? Hardware failure? Increased connection pauses because of DNS? Database migration mishap? Without at least understanding the underlying reason on the high level restarts and version rollbacks may not help or may even aggravate the situation. Am I missing anything here?
They is responsible for this! And We is gonna make sure They pay for the time We lost working on It, to fix They's problem.<p>We is out! No diggity.
haha, it makes sense this is on a site with “consulting” in the name—its unrealistic and pointless. the M. Night Shyamalan twist is that the whole thing is the “horror” portion.<p>in the story our protagonist has no knowledge of the service she is on call for and displays no ability to troubleshoot, reason about, or understand the issue she’s responding to.<p>instead, she clicks a series of buttons in a runbook to resolve the most basic and most happy-path production issue you’ll ever see.<p>aaaaand this is about devops how?
As a non-native English speaker, I still find it difficult to read texts where a singular person is referred to by a plural pronoun. Every "they" becomes ambiguous and requires more mental cycles to unpack. Is it about Morgan, or some unrelated group of people, or Morgan plus the group?<p>I understand the cultural reasons, but I regret the loss of linguistic clarity.