TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What is the best postmortem you've seen?

63 点作者 pbohun大约 2 年前
It seems like there are a lot of examples of companies handling a security breach or loss of service poorly. Are there examples when a company handled an incident well, especially with a great postmortem writeup?

17 条评论

yodon大约 2 年前
Edward Tufte&#x27;s analysis of the Space Shuttle Columbia explosion[0] is by far the most informative post mortem I&#x27;ve seen. It directly impacted everything I&#x27;ve written since reading it.<p>If you hit the link, you&#x27;ll see the page appears to be a wall of text, not a simple slide or two. As you read deeper into the report, you&#x27;ll understand that&#x27;s an intentional aspect of the report. (I&#x27;ll also note this is the Columbia explosion, not the better known Challenger disaster O-ring post-mortem discussed by Richard Feynman in his autobiography[1], even though that&#x27;s a great post mortem as well).<p>[0]<a href="https:&#x2F;&#x2F;www.edwardtufte.com&#x2F;bboard&#x2F;q-and-a-fetch-msg?msg_id=0001yB" rel="nofollow">https:&#x2F;&#x2F;www.edwardtufte.com&#x2F;bboard&#x2F;q-and-a-fetch-msg?msg_id=...</a><p>[1]<a href="https:&#x2F;&#x2F;www.amazon.com&#x2F;What-Care-Other-People-Think&#x2F;dp&#x2F;0393355640&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.amazon.com&#x2F;What-Care-Other-People-Think&#x2F;dp&#x2F;03933...</a>
geuis大约 2 年前
An old colleague of mine worked pretty extensively with a backend engineer on a script to do some data migration on user accounts. The other engineer did most of the work on the script itself. Mind you, the company has millions of users and didn&#x27;t use staging databases. (No idea if this is still true.)<p>Come D-day, my colleague runs the script on a limited group of users (a 100k or so) to validate. I forget the details, but something in the script was incorrect and ended up breaking some features for all of those users.<p>Once reports started coming in, they were super worried and semi-freaked out. A war room was setup that day and all the people involved jumped in.<p>One of the first things that happened after determining what happened was to calm them down and reassure they weren&#x27;t in trouble. After that the group worked on a solution for a few hours and established a plan to fix everything.<p>I was actually surprised that the response was so well handled. There was no finger pointing and just a group effort to fix the problem. To me, that&#x27;s how every problem should be handled, and not instilling fear for losing your job if something bad potentially happens.<p>To anyone who wants to leave replies about staging databases, bad dev practices, etc, don&#x27;t bother please. This was years ago and it was how things were done at the company. Our team was not part of the backend team or infra and worked with lots of areas of engineering on different issues.
评论 #35036162 未加载
评论 #35035935 未加载
评论 #35036213 未加载
KronisLV大约 2 年前
The best postmortem I&#x27;ve seen was actually a conference talk: &quot;Debugging Under Fire: Keep your Head when Systems have Lost their Mind&quot; by Bryan Cantrill (in the GOTO conference, 2017)<p>Here&#x27;s a YouTube video of it: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=30jNsCVLpAE">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=30jNsCVLpAE</a><p>Here&#x27;s the slides: <a href="https:&#x2F;&#x2F;gotochgo.com&#x2F;2017&#x2F;sessions&#x2F;86&#x2F;keynote-debugging-under-fire-keeping-your-head-when-systems-have-lost-their-mind" rel="nofollow">https:&#x2F;&#x2F;gotochgo.com&#x2F;2017&#x2F;sessions&#x2F;86&#x2F;keynote-debugging-unde...</a><p>It goes into detail about a pretty bad outage (when an entire data center was brought down), the human aspects, automation, how they handled it, the various risks, architectures, how things fail and about software development in general.
jxf大约 2 年前
Unfortunately I can&#x27;t share any of ours because they&#x27;re all proprietary client work, but I think my teams have done a really masterful job. They&#x27;re among some of the best I&#x27;ve read anywhere. For me the standout traits of a good postmortem are:<p>* Honesty about the reality of the situation; no sugarcoating, no spin<p>* Blameless, factual tone that avoids the passive voice<p>* Describes technical details at a level helpful for practitioners<p>* Makes use of other resources as needed (e.g. references corporate wiki, external ideas, blog posts)<p>* Good writing that&#x27;s easy to read and is free of grammatical ambiguities and spelling errors
simonblack大约 2 年前
A young guy driving on a country road without a seatbelt rolled his vehicle and died of his head injuries. Quite fascinating to discover that his only real problem was a badly bruised, contused and bleeding brain. Mind you, he probably would have had heart problems in his 50s because his heart arteries already had plaque and he was only in his twenties.<p>That was the best because it was the only postmortem I&#x27;ve seen.
评论 #35036282 未加载
abirch大约 2 年前
I love premortems at work. Gary Klein came up with the idea of asking why a project could fail before you start it: <a href="https:&#x2F;&#x2F;hbr.org&#x2F;2007&#x2F;09&#x2F;performing-a-project-premortem" rel="nofollow">https:&#x2F;&#x2F;hbr.org&#x2F;2007&#x2F;09&#x2F;performing-a-project-premortem</a>
monroewalker大约 2 年前
Roblox Oct 2021 Outage <a href="https:&#x2F;&#x2F;blog.roblox.com&#x2F;2022&#x2F;01&#x2F;roblox-return-to-service-10-28-10-31-2021&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.roblox.com&#x2F;2022&#x2F;01&#x2F;roblox-return-to-service-10-...</a>
ahakanbaba大约 2 年前
I learned quite a lot from this <a href="https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;details-of-the-cloudflare-outage-on-july-2-2019&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;details-of-the-cloudflare-outage...</a>
Eric_WVGG大约 2 年前
… the description of this topic didn&#x27;t follow the title in a way that I was expecting at all…<p>I immediately thought of the old GamaSutra (now GameDeveloper.com) postmortems; interviews with members of the teams behind many classic videogames, great late-night reading. <a href="https:&#x2F;&#x2F;www.gamedeveloper.com&#x2F;audio&#x2F;10-seminal-game-postmortems-every-developer-should-read" rel="nofollow">https:&#x2F;&#x2F;www.gamedeveloper.com&#x2F;audio&#x2F;10-seminal-game-postmort...</a>
gadders大约 2 年前
I used to have a blog compiling a bunch of them along with articles on best practice creation of post mortems. Unfortunately it never made any money and I took it off line when cpanel put their prices up 1000% and the hosting cost became too much.<p>I still have a back up somewhere and the domain names. Could maybe put it back up one day if I could spare the time and find a very cheap solution. It was a wordpress blog.<p>I realise this doesn&#x27;t help the OP. I just wanted to vent :-)
评论 #35041037 未加载
Icathian大约 2 年前
There was a really great podcast called The Downtime Project that dissected and discussed a postmortem in each episode. There were like a dozen episodes in the first season and they never did make a second one. Pity, I really, really liked it. Might be up your alley, it&#x27;s only a couple years out of date now.
yash1th大约 2 年前
definitely this - <a href="https:&#x2F;&#x2F;about.gitlab.com&#x2F;blog&#x2F;2017&#x2F;02&#x2F;10&#x2F;postmortem-of-database-outage-of-january-31&#x2F;" rel="nofollow">https:&#x2F;&#x2F;about.gitlab.com&#x2F;blog&#x2F;2017&#x2F;02&#x2F;10&#x2F;postmortem-of-datab...</a>
评论 #35036363 未加载
mgl大约 2 年前
<a href="https:&#x2F;&#x2F;groups.google.com&#x2F;g&#x2F;google-appengine&#x2F;c&#x2F;p2QKJ0OSLc8" rel="nofollow">https:&#x2F;&#x2F;groups.google.com&#x2F;g&#x2F;google-appengine&#x2F;c&#x2F;p2QKJ0OSLc8</a>
jFriedensreich大约 2 年前
We had a payment system for vaccination field workers in africa that stopped working, so people did not get paid. There was a section in the post mortem template that went something like<p>&quot;what is the impact of the error: there is an angry mob with torches demanding to get paid outside&quot;
tpoacher大约 2 年前
Dont have a link at hand, but the report investigating the infamous Therac incident tops my list.
moremetadata大约 2 年前
<a href="https:&#x2F;&#x2F;gvnshtn.com&#x2F;posts&#x2F;maersk-me-notpetya&#x2F;" rel="nofollow">https:&#x2F;&#x2F;gvnshtn.com&#x2F;posts&#x2F;maersk-me-notpetya&#x2F;</a><p>Its a long read, but gives an insight how the ransomware NotPetya crippled Maersk and how they recovered.<p>Looking at how the earlier ransomware WannaCry crippled the crown jewels of many countries, highlights a weakness in non diverse systems.<p>I even know who is behind them, but I cant prove it, so why even mention it? Because I&#x27;m getting closer to proving it, which makes this game all the more interesting, even they have weaknesses they have failed to identify!<p>The WannaCry weekend was when I met Dame Stella Rimington and Baron Jonathon Evans hill walking at Scafell Pike and you can call me the world famous Walter Mitty because people are so obedient to authority.
idlewords大约 2 年前
Our lord and savior Jesus Christ
评论 #35035862 未加载
评论 #35036908 未加载