The timeline of events was interesting (and much appreciated), but the root cause analysis doesn't really go much deeper than "we had a brief network partition, and our systems weren't designed to cope with it", which still leaves a whole lot of question marks.<p>Of course, without detailed knowledge of how GitHub's internals work, all we can do is speculate. But just based on what was explained in this blog post, it sounds like they're replicating database updates asynchronously, without waiting for the updates to be acknowledged by slaves before the master allows them to commit. Which means the data on slaves is always slightly out-of-date, and becomes more out-of-date when the slaves are partitioned from the master. Which means that promoting a slave to master will <i>by definition</i> lose some committed writes.<p>If "guarding the confidentiality and integrity of user data is GitHub’s highest priority", then why would they build and deploy an automated failover system whose purpose is to preserve availability at the cost of consistency? And why were they apparently caught off-guard when it operated as designed?<p>(Reading point 1 under "technical initiatives", it seems that they consider intra-DC failover to be "safe", and cross-DC failover to be "unsafe". But the exact same failure mode is present in both cases; the only difference is the length of the time during which in-flight writes can be lost.)
> With this incident, we failed you, and we are deeply sorry.<p>I'm always impressed when people actually apologize instead of dancing around an almost apology.<p>Overall this is one of the best post mortems I've seen - great tone, very well written, super informative, has all of the steps (apology, information on issue, real steps to prevent it happening again) that hurt customers generally want to see. Really impressive timelines too - 43s reconnect after initial issue, 15 minutes to change status.<p>Gitlab overall seems to have more incidents like this and I really like their custom of working through them in public Google docs. Definitely seems a better idea for incident communication than relying on your own tech (github pages) for incident communication.
Disclosure: I work on Google Cloud.<p>I’m a little confused by this part:<p>> While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.<p>At first, I had assumed this was Glacier (“it took a long time to download”). But the daily retrieval testing suggests it’s likely just regular S3. Multiple TB sounds like less than 10.<p>So the question becomes “Did GitHub have less than 100 Gbps of peering to AWS?”. I hope that’s an action item if restores were meant to be quick (and likely this will be resolved by migrating to Azure, getting lots of connectivity, etc.).
In my career, the worst outages (longest downtime) I can recall have been due to HA + automatic failover. Everything from early NetApp clustering solutions corrupting the filesystem to cross-country split-brain issues like this.<p>Admittedly, I don't recall all the incidents where automatic failover minimized downtime, and probably if a human had to intervene in each of those, the cumulative downtime would be more significant.<p>But boy, it sure doesn't feel like it.
Reading this post-mortem and their MySQL HA post, this incident deserves a talk titled: "MySQL semi-synchronous replication with automatic inter-DC failover to a DR site: how to turn a 47s outage into a 24h outage requiring manual data fixes"<p><a href="https://githubengineering.com/mysql-high-availability-at-github/#semi-synchronous-replication" rel="nofollow">https://githubengineering.com/mysql-high-availability-at-git...</a><p>> In MySQL’s semi-synchronous replication a master does not acknowledge a transaction commit until the change is known to have shipped to one or more replicas. [...]<p>> Consistency comes with a cost: a risk to availability. Should no replica acknowledge receipt of changes, the master will block and writes will stall. Fortunately, there is a timeout configuration, after which the master can revert back to asynchronous replication mode, making writes available again.<p>> We have set our timeout at a reasonably low value: 500ms. It is more than enough to ship changes from the master to local DC replicas, and typically also to remote DCs.<p><a href="https://blog.github.com/2018-10-30-oct21-post-incident-analysis/#2018-october-21-2252-utc" rel="nofollow">https://blog.github.com/2018-10-30-oct21-post-incident-analy...</a><p>> The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.<p>> However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls.
> Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.<p>Network computing is rough. (I am not being at all sarcastic).
I find their incident management times pretty impressive. 2 minutes to detect & alert, ~2+ minutes response, 10 minutes to initial triage, 15 minutes to internal change control, 17 minutes to escalation & public communication, 19 minutes to major incident escalation & broad engagement, 73 minutes to remediation start & further public communication.<p>Initial triage inside 10 minutes then change control, escalation, major incident, and public communication within another 9 minutes. That's hard to beat with humans in the loop.
It's great to see how open they are about what happened and how it got fixed. I also appreciate the decision to prioritise <i>data integrity</i> over <i>site usability</i> too.<p>Looks like Github could be OK after the Microsoft acquisition.
Offtopic, but this reminds me: did google ever issue an incident report on why youtube went down earlier this month? A quick search didn't turn up anything.
There is often a trade-off between a large distributed central store and several independent ones. The primary con of the former being incidents like this one, whereas the con of the latter is separate systems to perform eventually consistent aggregation to support some centralized features adding complexity. I wonder if there is any value in GitHub decentralizing the metadata pipeline. So many of the actions are namespaced that this could be reasonable theoretically, at a large practical cost.<p>On a related note, I often reach for Cassandra when starting projects knowing that building my application around its limited access approach has data replication benefits in the future. For all the flexibility benefits to devs given by SQL/RDBMSs, there are flexibility downsides.
If I read things correctly, they made a fairly... interesting... tradeoff: 954 still-as-yet-unreconciled DB writes in exchange for 24 hours of site downtime.<p>I think I'd have made a different choice, but cool that they were upfront about it.
I’m not super versed in MySQL failover, but am I correct to conclude the Orchestrator and RAFT also did them in? And isn’t that architectural components right to existence to exactly prohibit such a situation from happening? Genuine question.
Postmortems are always really interesting, if you're preparing for an interview where system design will be discussed you could do worse than incorporate reading a couple to help in preparation.
>At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds<p>lesson: a technician replacing a switch must be able to do it faster than the leader heartbeat timeout of the consensus protocol. (reminds me how in high school we trained to very quickly disassemble/assemble Kalashnikov machine gun - something like under 20 sec. total - the whole choreographed sequence of movements was learned and practiced like a samurai sword kata :)
>> It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.<p>So basically Orchestrator acted correctly but application layer is not nicely integrated with it? Sounds like something not well designed on application side, which offests the whole point of global consensus. Not much detail provided for what exactly went wrong on this though.<p>They mentioned that they'll take extra care of this but I'm still very concerned about the reality that two systems (orchestrator & app) are so loosely coupled.
It seems like the obvious play here is to fail-over all workloads to the west coast, so they don't incur cross-regional latency. Do they explain why this wasn't possible? If so I cannot find it.
This is a great write up - making these public with all the gory technical details helps us all be better at our jobs.<p>I’m curious whether shutting down the east coast apps entirely and running off of the west coast was considered? Not enough capacity or some other problem?<p>edit: I guess the third technical initiative strongly implies that it was just a capacity issue: "This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact."
On a side note, does anyone know if anything special was used to create the descriptive images in this blog post? They look great and describe connections between related regions really well.<p>Eg <a href="https://blog.github.com/assets/img/2018-10-25-oct21-post-incident-analysis/recovery-flow.png" rel="nofollow">https://blog.github.com/assets/img/2018-10-25-oct21-post-inc...</a>
Basically<p>"..we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub."
The first diagram labeled "Normal Topology" shows Master in East and no other master. Later they acknowledge that lots of stuff doesn't work if Master is not in East because of latency. So then there's all this Orchestration, and it never could have worked in the first place?<p>That seems incredible - what am I missing?
Untested RPC timeouts strike again. Every service needs integration tests that exercise timeouts. Some config and code can't be exercised in automated tests and needs regular disaster readiness testing.<p>Service client libraries need unit tests that show the library returning expected errors for all the failure scenarios: missing config, name lookup failure, unreachable, refusing connections, closing connections, returning error responses, returning garbage responses, refusing writes, responding with high latency, and responding with low throughput.
This makes me wonder why GitHub is effectively using a single database (though replicated with HA considerations and CQRS-ized).<p>From the article, it appears they are are partitioning based on function (commits in this DB, PRs in this cluster)... but I just don't see a strong business need to glob all commit data together into one massive datastore. Perhaps it's an economy-of-scale driver?
So they geographically replicated their mysql in order to survive such a partition and instead they destroyed their entire database and now have no reason to believe any of it is consistent at all.