科技回声

9 条评论

That's not seriously the real engineering portmortem is it? That looks more like a 'Resolved the issue' update - it is way too shallow and vague.If this was sent out at AWS as a COE (postmortem), it would be ripped apart - it is not going to satisfy anyone reading it that they should have confidence this class of failure isn't going to happen again. It looks like they haven't even identified the root cause(s) of the failure...

评论 #21169743 未加载

unilynx超过 5 年前

Unfortunately, this doesn't explain why this change was applied to 5 datacenters at the same time, or why they didn't do that but it still affected 5 of them.I would have liked to hear more about how they are going to reduce the blast radius of such a change, because it sounds like something that could have been deployed to a single datacenter first

评论 #21168171 未加载

ebcode超过 5 年前

Adding my voice to the chorus here as a DO customer. This 188-word "postmortem" gives postmortems a bad name. I would like to know the details of the "network configuration change" and why it "caused incompatibilities". And also how you will ensure that this particular failure will not re-occur.Trust and transparency are the currencies of the internet, in the same way that cigarettes and contraband are the currencies in prison. This post is worth approx. a 1/2 smoked cigarette.

caiobegotti超过 5 年前

It's because of "reports" like these I didn't feel like staying as their customer. Whomever is in charge of [limiting scope and wording of] these reports should listen to a few things in private at their HQs.

评论 #21168862 未加载

notacoward超过 5 年前

This happened, and was apparently resolved, on Monday. Am I the only one who wonders if it was released on a Saturday to minimize the amount of attention/commentary it would get (e.g. here)?

lucb1e超过 5 年前

Aside from a timeline and some promises, this is the full post-mortem analysis of what happened:> The outage was triggered as a result of a networking configuration change on the Block Storage clusters to improve handling packet loss scenarios. The new setting caused incompatibilitiesSo that doesn't tell us very much about the cause ("a networking configuration change") nor the effect ("incompatibilities").

mlthoughts2018超过 5 年前

> “The outage was triggered as a result of a networking configuration change on the Block Storage clusters to improve handling packet loss scenarios. The new setting caused incompatibilities, which led to network interfaces becoming unavailable.”I wonder if this just means someone changed an MTU configuration and it led to tons of fragmentation in different components of the network, and especially for any large file transfer making things timeout constantly to render an outage. Just a wild guess, but I’ve seen this happen with in-house datacenters before, so perhaps.

exabrial超过 5 年前

Nice details in the report. Come on DO, this was the most annoyingly PC, lawyer sanitized, non-scientific RCA anyone here has ever read. Here's the BLUF line: Don't cause issues. That causes problems.

adreamingsoul超过 5 年前

Sigh, I really like DO but this just shows how much they still need to learn about operations. AWS does “raise” the bar for that, but unfortunately you can only really see that from the inside.

9 条评论

redredrobot超过 5 年前

评论 #21169743 未加载

unilynx超过 5 年前

评论 #21168171 未加载

ebcode超过 5 年前

caiobegotti超过 5 年前

评论 #21168862 未加载

notacoward超过 5 年前

This happened, and was apparently resolved, on Monday. Am I the only one who wonders if it was released on a Saturday to minimize the amount of attention/commentary it would get (e.g. here)?

lucb1e超过 5 年前

mlthoughts2018超过 5 年前

exabrial超过 5 年前

adreamingsoul超过 5 年前

Sigh, I really like DO but this just shows how much they still need to learn about operations. AWS does “raise” the bar for that, but unfortunately you can only really see that from the inside.

Block Storage Issues Across All Regions: Incident Report for DigitalOcean

9 条评论

Block Storage Issues Across All Regions: Incident Report for DigitalOcean

9 条评论