October 21 post-incident analysis

464 pointsby pietroalbiniover 6 years ago

31 comments

teraflopover 6 years ago

The timeline of events was interesting (and much appreciated), but the root cause analysis doesn't really go much deeper than "we had a brief network partition, and our systems weren't designed to cope with it", which still leaves a whole lot of question marks.Of course, without detailed knowledge of how GitHub's internals work, all we can do is speculate. But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will by definition lose some committed writes.If "guarding the confidentiality and integrity of user data is GitHub’s highest priority", then why would they build and deploy an automated failover system whose purpose is to preserve availability at the cost of consistency? And why were they apparently caught off-guard when it operated as designed?(Reading point 1 under "technical initiatives", it seems that they consider intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the exact same failure mode is present in both cases; the only difference is the length of the time during which in-flight writes can be lost.)

评论 #18341979 未加载

评论 #18342580 未加载

评论 #18343839 未加载

评论 #18344559 未加载

评论 #18344140 未加载

评论 #18342868 未加载

评论 #18342999 未加载

sixhobbitsover 6 years ago

> With this incident, we failed you, and we are deeply sorry.I'm always impressed when people actually apologize instead of dancing around an almost apology.Overall this is one of the best post mortems I've seen - great tone, very well written, super informative, has all of the steps (apology, information on issue, real steps to prevent it happening again) that hurt customers generally want to see. Really impressive timelines too - 43s reconnect after initial issue, 15 minutes to change status.Gitlab overall seems to have more incidents like this and I really like their custom of working through them in public Google docs. Definitely seems a better idea for incident communication than relying on your own tech (github pages) for incident communication.

评论 #18344570 未加载

评论 #18343422 未加载

评论 #18342853 未加载

boulosover 6 years ago

Disclosure: I work on Google Cloud.I’m a little confused by this part:> While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10.So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).

评论 #18341852 未加载

评论 #18341745 未加载

评论 #18341770 未加载

js2over 6 years ago

In my career, the worst outages (longest downtime) I can recall have been due to HA + automatic failover. Everything from early NetApp clustering solutions corrupting the filesystem to cross-country split-brain issues like this.Admittedly, I don't recall all the incidents where automatic failover minimized downtime, and probably if a human had to intervene in each of those, the cumulative downtime would be more significant.But boy, it sure doesn't feel like it.

评论 #18342389 未加载

评论 #18342456 未加载

评论 #18345113 未加载

评论 #18349590 未加载

teromover 6 years ago

Reading this post-mortem and their MySQL HA post, this incident deserves a talk titled: "MySQL semi-synchronous replication with automatic inter-DC failover to a DR site: how to turn a 47s outage into a 24h outage requiring manual data fixes"<a href="https://githubengineering.com/mysql-high-availability-at-github/#semi-synchronous-replication" rel="nofollow">https://githubengineering.com/mysql-high-availability-at-git...</a>> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. [...]> Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.> We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs.<a href="https://blog.github.com/2018-10-30-oct21-post-incident-analysis/#2018-october-21-2252-utc" rel="nofollow">https://blog.github.com/2018-10-30-oct21-post-incident-analy...</a>> The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.> However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls.

评论 #18347932 未加载

jrochkind1over 6 years ago

> Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.Network computing is rough. (I am not being at all sarcastic).

评论 #18341489 未加载

评论 #18341487 未加载

donavanmover 6 years ago

I find their incident management times pretty impressive. 2 minutes to detect & alert, ~2+ minutes response, 10 minutes to initial triage, 15 minutes to internal change control, 17 minutes to escalation & public communication, 19 minutes to major incident escalation & broad engagement, 73 minutes to remediation start & further public communication.Initial triage inside 10 minutes then change control, escalation, major incident, and public communication within another 9 minutes. That's hard to beat with humans in the loop.

评论 #18344112 未加载

gitgudover 6 years ago

It's great to see how open they are about what happened and how it got fixed. I also appreciate the decision to prioritise data integrity over site usability too.Looks like Github could be OK after the Microsoft acquisition.

评论 #18342201 未加载

评论 #18345174 未加载

IvyMikeover 6 years ago

Offtopic, but this reminds me: did google ever issue an incident report on why youtube went down earlier this month? A quick search didn't turn up anything.

评论 #18343567 未加载

评论 #18353378 未加载

评论 #18344094 未加载

kodablahover 6 years ago

There is often a trade-off between a large distributed central store and several independent ones. The primary con of the former being incidents like this one, whereas the con of the latter is separate systems to perform eventually consistent aggregation to support some centralized features adding complexity. I wonder if there is any value in GitHub decentralizing the metadata pipeline. So many of the actions are namespaced that this could be reasonable theoretically, at a large practical cost.On a related note, I often reach for Cassandra when starting projects knowing that building my application around its limited access approach has data replication benefits in the future. For all the flexibility benefits to devs given by SQL/RDBMSs, there are flexibility downsides.

评论 #18342485 未加载

sanquiover 6 years ago

Sounds like a ton of complicated, fragile work done under crunch times and probably lots of stress. Hats off to the teams at GitHub.

eric_bover 6 years ago

If I read things correctly, they made a fairly... interesting... tradeoff: 954 still-as-yet-unreconciled DB writes in exchange for 24 hours of site downtime.I think I'd have made a different choice, but cool that they were upfront about it.

评论 #18342208 未加载

评论 #18345786 未加载

tnoletover 6 years ago

I’m not super versed in MySQL failover, but am I correct to conclude the Orchestrator and RAFT also did them in? And isn’t that architectural components right to existence to exactly prohibit such a situation from happening? Genuine question.

评论 #18343041 未加载

评论 #18341532 未加载

40acresover 6 years ago

Postmortems are always really interesting, if you're preparing for an interview where system design will be discussed you could do worse than incorporate reading a couple to help in preparation.

评论 #18342221 未加载

trhwayover 6 years ago

>At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 secondslesson: a technician replacing a switch must be able to do it faster than the leader heartbeat timeout of the consensus protocol. (reminds me how in high school we trained to very quickly disassemble/assemble Kalashnikov machine gun - something like under 20 sec. total - the whole choreographed sequence of movements was learned and practiced like a samurai sword kata :)

azurezyqover 6 years ago

>> It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.So basically Orchestrator acted correctly but application layer is not nicely integrated with it? Sounds like something not well designed on application side, which offests the whole point of global consensus. Not much detail provided for what exactly went wrong on this though.They mentioned that they'll take extra care of this but I'm still very concerned about the reality that two systems (orchestrator & app) are so loosely coupled.

评论 #18341574 未加载

jeremyjhover 6 years ago

It seems like the obvious play here is to fail-over all workloads to the west coast, so they don't incur cross-regional latency. Do they explain why this wasn't possible? If so I cannot find it.

评论 #18342628 未加载

评论 #18342492 未加载

itsdrewmillerover 6 years ago

This is a great write up - making these public with all the gory technical details helps us all be better at our jobs.I’m curious whether shutting down the east coast apps entirely and running off of the west coast was considered? Not enough capacity or some other problem?edit: I guess the third technical initiative strongly implies that it was just a capacity issue: "This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact."

评论 #18341780 未加载

jacquesmover 6 years ago

It's better to go down hard than to have a crappy fix that tries to keep things alive.

crescentfreshover 6 years ago

On a side note, does anyone know if anything special was used to create the descriptive images in this blog post? They look great and describe connections between related regions really well.Eg <a href="https://blog.github.com/assets/img/2018-10-25-oct21-post-incident-analysis/recovery-flow.png" rel="nofollow">https://blog.github.com/assets/img/2018-10-25-oct21-post-inc...</a>

评论 #18341808 未加载

评论 #18341459 未加载

评论 #18341505 未加载

carlsborgover 6 years ago

Basically"..we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub."

tlynchpinover 6 years ago

The first diagram labeled "Normal Topology" shows Master in East and no other master. Later they acknowledge that lots of stuff doesn't work if Master is not in East because of latency. So then there's all this Orchestration, and it never could have worked in the first place?That seems incredible - what am I missing?

ardeover 6 years ago

The CAP theorem strikes again.

mleonhardover 6 years ago

Untested RPC timeouts strike again. Every service needs integration tests that exercise timeouts. Some config and code can't be exercised in automated tests and needs regular disaster readiness testing.Service client libraries need unit tests that show the library returning expected errors for all the failure scenarios: missing config, name lookup failure, unreachable, refusing connections, closing connections, returning error responses, returning garbage responses, refusing writes, responding with high latency, and responding with low throughput.

Jupeover 6 years ago

This makes me wonder why GitHub is effectively using a single database (though replicated with HA considerations and CQRS-ized).From the article, it appears they are are partitioning based on function (commits in this DB, PRs in this cluster)... but I just don't see a strong business need to glob all commit data together into one massive datastore. Perhaps it's an economy-of-scale driver?

评论 #18347496 未加载

LolNoGenericsover 6 years ago

No reflection on the root cause? I am not into hardware but shouldn't there an redundant network connection in place or at least set one up?

hartatorover 6 years ago

Very impressive complexity. Really appreciate the transparency. Much has been a couple of stressful days.

Rapzidover 6 years ago

Didn't lose data perhaps, but sounds like information was definitely lost and/or corrupted.

person_of_colorover 6 years ago

I didn't learn any of this in school. Any MOOCs for cloud architecture?

romedover 6 years ago

So they geographically replicated their mysql in order to survive such a partition and instead they destroyed their entire database and now have no reason to believe any of it is consistent at all.

anonymousismeover 6 years ago

Just five days before Microsoft completed the GitHub acquisition so they had to lower their QoS to meet the expectations of everyone using Azure...