Fundamentals of Incident Management

171 pointsby bitfieldalmost 4 years ago

9 comments

gengelbroalmost 4 years ago

It's not inspiring to me with the 'cool tactical' framing this article attempts to convey.I've worked as an oncall for a fundamental backbone service of the internet in the past and paged into middle of the night outages. It's harrowing and exhausting. Cool names like 'incident commander' do not change this.We also had a "see ya in the morning" culture. Instead I'd be much more impressed to have a "see ya in the afternoon, get some sleep" culture.

评论 #28120655 未加载

评论 #28121150 未加载

评论 #28121036 未加载

评论 #28122100 未加载

评论 #28120750 未加载

评论 #28120587 未加载

评论 #28124041 未加载

评论 #28122605 未加载

评论 #28121748 未加载

评论 #28122383 未加载

krisoftalmost 4 years ago

> which captures all the key log files and status information from the ailing machine.Machine? As in singular machine goes down and you wake up 5 people? That just sounds like bad planning.> Pearson is spinning up a new cloud server, and Rawlings checks the documentation and procedures for migrating websites, getting everything ready to run so that not even a second is wasted.Heroic. But in reality you have already wasted minutes. Why is this not all automated?I understand that this is a simulated scenairo. Maybe the situation was simplified for clarity, but really if a single machine going down leads to this amount of heroics then you should work on those fundamentals. In my opinion.

评论 #28121805 未加载

评论 #28120621 未加载

评论 #28120371 未加载

quartzalmost 4 years ago

Nice to see articles like this describing a company's incident response process and the positive approach to incident culture via gamedays (disclaimer: I'm a cofounder at Kintaba[1], an incident management startup).Regarding gamedays specifically: I've found that many company leaders don't embrace them because culturally they're not really aligned to the idea that incidents and outages aren't 100% preventable.It's a mistake to think of the incident management muscle as one you'd like exercised as little as possible when in reality it's something that should be in top form because doing so comes with all kinds of downstream values for the company (a positive culture towards resiliency, openness, team building, honesty about technical risk, etc).Sadly this can be a difficult mindset to break out of especially if you come from a company mired in "don't tell the exec unless it's so bad they'll find out themselves anyway."Relatedly, the desire to drop the incident count to zero discourages recordkeeping of "near-miss" incidents, which generally deserve to have the same learning process (postmortem, followup action items, etc) associated with them as the outcomes of major incidents and game days.Hopefully this outdated attitude continues to die off.If you're just getting started with incident response or are interested in the space, I highly recommend:- For basic practices: Google's SRE chapters on incident management [2]- For the history of why we prepare for incidents and how we learn from them effectively: Sidney Dekker's Field Guide to Understanding Human Error [3][1] <a href="https://kintaba.com" rel="nofollow">https://kintaba.com</a>[2] <a href="https://sre.google/sre-book/managing-incidents/" rel="nofollow">https://sre.google/sre-book/managing-incidents/</a>[3] <a href="https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/1472439058" rel="nofollow">https://www.amazon.com/Field-Guide-Understanding-Human-Error...</a>

评论 #28120911 未加载

评论 #28121485 未加载

评论 #28128361 未加载

评论 #28125008 未加载

blamestrossalmost 4 years ago

I've done a LOT of incident management and I'm not happy about it. The biggest issue I have run into other than burnout is this:Thinking and reasoning under pressure are the enemy. Make as many decisions in advance as possible. Make flowcharts and decision trees with "decision criteria" already written down.If you have to figure something out or make a "decision" then things are really really bad. That happens sometimes, but when teams don't prep at all for incident management (pre-determined plans for common classes of problem) every incident is "really really bad"If have a low risk, low cost action with low confidence of high reward, I'm going to do it and just tell people it happened. Asking means I just lost a half-hour+ worth of money and if I just did it and I was wrong we would have lost 2 minutes of money. When management asks me why I did that, I point at the doc I wrote that my coworkers reviewed and mostly forgot about.A really common example is "it looks like most the errors are in datacenter X", you fail out of the datacenter. Maybe it was sampling bias or some other issue and it doesn't help, maybe the problem follows the traffic, maybe it just suddenly makes things better. No matter what we get signal. Establish well in advance of a situation what the common "solutions" to problems are and if you are oncall and responding, then just DO them and document+communicate as you do.

mimiralmost 4 years ago

It sort of baffles me how much engineer time is seemingly spent here designing and running these "gamedays" vs just improving and automating the underlying systems. Don't glorify getting paged, glorify systems that can automatically heal themselves.I spend a good amount of time doing incident management and reliability work.Red team/blue team gamedays seems like a waste of time. Either you are so early on your reliability journey that trivial things like "does my database failover" are interesting things to test (in which case just fix it). Or, you're a more experienced team and there's little low hanging reliability fruit left. In the later, gamedays seem unlikely to that closely mimic a real world incident. Since low hanging fruit is gone, all your serious incidents tend to be complex failure interactions between various system components. To resolve them quickly, you simply want all the people with deep context on those systems quickly coming up with and testing out competing hypotheses on what might be wrong. Incident management only really matters in the sense that you want to allow the people with the most system context to focus on fixing the actual system. Serious incident management really only comes into play when the issue is large enough to threaten the company + require coordinated work from many orgs/teams.My team and I spend most of time thinking about how we can automate any repetitive tasks or failover. In the case something can't be automated, we think about how we can increase the observability of the system, so that future issues can be resolved faster.

评论 #28125393 未加载

spa3thybalmost 4 years ago

There is a month and day, Feb 15, in the header, but no year. I can't figure out if that's ironic or apropos, since this story reads like a thriller from perhaps ten years ago, but the post date appears to have been 2020-02-15 - yikes.

rachelbythebayalmost 4 years ago

I may never understand why some places are all about assigning titles and roles in this kind of thing. You need one, maybe two, plus a whole whack of technical skills from everyone else.Also, conference calls are death.

评论 #28122041 未加载

ipaddralmost 4 years ago

So they are testing against fully awake people at 2:30pm and expecting similiar results at 4:30am after heavy drinking.

评论 #28120316 未加载

评论 #28120338 未加载

denton-scratchalmost 4 years ago

It doesn't match my experience, with a real incident.I was a dev in a small web company (10 staff), moonlighting as sysadmin. Our webserver had 40 sites on it. It was hit by a not-very-clever zero-day exploit, and most of the websites were now running the attacker's scripts.It fell to me to sort it out - the rest of the crew were to keep on coding websites. The ISP had cut off the server's outbound email, because it was spewing spam. So I spent about an hour trying to find the malicious scripts, before I realised that I could never be certain that I'd found them all.You get an impulse to panic when you realise that the company's future (and your job) depends on you not screwing up; and you're facing a problem you've never faced before.So I commissioned a new machine, and configured it. I started moving sites across from the old machine to the new one. After about three sites, I decided to script the moving work. Cool.But the sites weren't all the same - some were Drupal (different versions), some were Wordpress, some were custom PHP. It worked for about 30 of the sites, with a lot of per-site manual tinkering.Note that for the most part, the sites weren't under revision control - there were backups in zip files, from various dates, for some of the sites. And I'd never worked on most of those sites, each of which had its own quirks. So I spent the next week making every site deploy correctly from the RCS.I then spent about a week getting this automated, so that in a future incident we could get running again quickly. Happily we had a generously-configured Xen server, and I could test the process on VMs.My colleagues weren't allowed to help out, they were supposed to go on making websites. And I got resistance from my boss, demanding status updates ("are we there yet?")The happy outcome is that that work became the kernel of a proper CI pipeline, and provoked a fairly deep change in the way the company worked. And by the end, I knew all about every site the company hosted.We were just a web-shop; most web-shops are (or were) like this. If I was doing routine sysadmin, instead of coding websites, I was watched like a hawk to make sure I wasn't doing anything 'unnecessary'.This incident gave me the authority to do the sysadmin job properly; and in fact it saved me a lot of sysadmin time - because previously, if a dev wanted a new version of a site deployed, I had to interrupt whatever I was doing to deploy it. With the CI pipeline, provided the site had passed some testing and review stage, it could be deployed to production by the dev himself.It would have been cool to be able to do recovery drills, rotating roles and so on; but it was enough for my bosses that more than one person knew how to rebuild the server from scratch, and that it could be done in 30 minutes.Life in a small web-shop could get exciting, occasionally.

评论 #28122342 未加载