I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.<p>I run <a href="http://CircleCi.com" rel="nofollow">http://CircleCi.com</a>, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.
I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.<p>Here's sortof the seminal post on the matter in the mysql community: <a href="http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-solutions-in-master-master-replication/" rel="nofollow">http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...</a><p>Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.
Here are the makings of a bad week (Monday of all things)<p>- MySQL schema migration causes high load, automated HA solution causes cascading database failure<p>- MySQL cluster becomes out of sync<p>- HA solution segfaults<p>- Redis and MySQL become out of sync<p>- Incorrect users have access to private repositories!<p>Cleanup and recovery takes time, all I can say is, <i>I'm glad it was not me who had that mess to clean up</i>. I'm sure they are still working on it too!<p>This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.
The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.<p>In GitHub we trust. I can't imagine putting my code anywhere else right now.
The part of this post that really blew my mind:<p><pre><code> We host our status site on Heroku to ensure its availability
during an outage. However, during our downtime on Tuesday
our status site experienced some availability issues.
As traffic to the status site began to ramp up, we increased
the number of dynos running from 8 to 64 and finally 90.
This had a negative effect since we were running an old
development database addon (shared database). The number of
dynos maxed out the available connections to the database
causing additional processes to crash.
</code></pre>
Ninety dynos for a status page? What was going on there?
Well, I have to say... replication related issues like this are why I/we are now using a Galera backed DB cluster. No need to worry about which server is active/passive. You can technically have them all live all the time. In our case we have two live and one failover that only gets accessed by backup scripts and some maintenance tasks.<p>Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.
If Github hasn't gotten their custom HA solution right, will you?<p>Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.<p>How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?<p>There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!
"As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90."<p>Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.<p>This part makes no sense at all.<p>At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.
<i>"16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members"</i><p>ouch!
Update strategy of master first is interesting. I've always seen the other way with update standby, flip to standby, verify, update original master.
Auto inc db keys once again cause horribleness. Nothing new there I suppose.
And as mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt this a couple static objects. Automagically generate and push if you want. Give 'em a 60 second TTL and call it a day. Put them behind a different CDN & DNS then the rest of your site for bonus points.
Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.
Genuine question: github is built upon git, which is a rock solid system for storing dataand in these reports we read that github relies a lot on MySQL, so... Did the github guys ponder using git as their data store? Just an example, in git one can add comments on commits, would it be possible to use it for the github comment function? Or maybe it is?