GitHub availability this week

203 pointsby tanokuover 12 years ago

15 comments

pbiggarover 12 years ago

I know that they have to be apologetic like this, but the simple fact is that GitHub's uptime is fantastic.I run <a href="http://CircleCi.com" rel="nofollow">http://CircleCi.com</a>, and so we have upwards of 10,000 interactions with GitHub per day, whether API calls, clones, pulls, webhooks, etc. A seriously seriously small number of them fail. They know what they're doing, and they do a great job.

cagenutover 12 years ago

I'd like to welcome the github ops/dbas to the club of people who've learned the hard way that automated database failover usually causes more downtime than it prevents.Here's sortof the seminal post on the matter in the mysql community: <a href="http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-solutions-in-master-master-replication/" rel="nofollow">http://www.xaprb.com/blog/2009/08/30/failure-scenarios-and-s...</a>Though it turns into an MMM pile-on the tool doesn't matter so much as the scenarios. Automated failover is simply unlikely to make things better, and likely to make things worse, in most scenarios.

评论 #4525102 未加载

评论 #4524679 未加载

评论 #4525088 未加载

WestCoastJustinover 12 years ago

Here are the makings of a bad week (Monday of all things)- MySQL schema migration causes high load, automated HA solution causes cascading database failure- MySQL cluster becomes out of sync- HA solution segfaults- Redis and MySQL become out of sync- Incorrect users have access to private repositories!Cleanup and recovery takes time, all I can say is, I'm glad it was not me who had that mess to clean up. I'm sure they are still working on it too!This brings to mind some my bad days.. OOM killer decides your Sybase database is using too much memory. Hardware error on DRBD master causes silent data corruption (this took a lot of recovery time on TBs of data). I've been bitten by the MySQL master/slave become out of sync. That is a bad place to be in.. do you copy your master database to the slaves.. that takes a long time even of a fast network.

评论 #4525127 未加载

andrewljohnsonover 12 years ago

The lack of any negative response on this thread is a testament both to the thoroughness of the post-mortem, and the outstanding quality of GitHub in general.In GitHub we trust. I can't imagine putting my code anywhere else right now.

评论 #4526085 未加载

cwb71over 12 years ago

The part of this post that really blew my mind:<pre><code> We host our status site on Heroku to ensure its availability during an outage. However, during our downtime on Tuesday our status site experienced some availability issues. As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90. This had a negative effect since we were running an old development database addon (shared database). The number of dynos maxed out the available connections to the database causing additional processes to crash. </code></pre> Ninety dynos for a status page? What was going on there?

评论 #4524512 未加载

评论 #4524494 未加载

druiidover 12 years ago

Well, I have to say... replication related issues like this are why I/we are now using a Galera backed DB cluster. No need to worry about which server is active/passive. You can technically have them all live all the time. In our case we have two live and one failover that only gets accessed by backup scripts and some maintenance tasks.Once we got the kinks worked out it has been performing amazingly! Wonder if GitHub looked into this kind of a setup before selecting the cluster they did.

评论 #4524455 未加载

aaronblohowiakover 12 years ago

If Github hasn't gotten their custom HA solution right, will you?Digging into their fix, they disabled automatic failover -- so all DB failures will now require manual intervention. While addressing this particular (erroneous) failover condition, it does raise minimum down time for true failures. Also, their mysql replicant's misconfiguration upon switching masters is also tied to their (stopgap) approach to preventing the hot failover. So, the second problem was due to a mis-use/misunderstanding of maintenance-mode.How is it possible that the slave could be pointed at the wrong master and have nobody notice for a day? What is the checklist to confirm that failover has occurred correctly?There is also lesson to be learned in the fact that their status page had scaling issues due to db connection limits. Static files are the most dependable!

评论 #4524633 未加载

评论 #4524934 未加载

评论 #4524724 未加载

jyapover 12 years ago

"As traffic to the status site began to ramp up, we increased the number of dynos running from 8 to 64 and finally 90."Wait, why isn't there some caching layer? eg. Generate a static page or use Varnish.This part makes no sense at all.At most you're then firing up another 5 dynos (or none) to handle the traffic. 90 is ridiculous.

评论 #4525936 未加载

jluxenbergover 12 years ago

"16 of these repositories were private, and for seven minutes from 8:19 AM to 8:26 AM PDT on Tuesday, Sept 11th, were accessible to people outside of the repository's list of collaborators or team members"ouch!

评论 #4524929 未加载

评论 #4525312 未加载

dumbluckover 12 years ago

This was the awesome kind of explanation about what went wrong and what was learned that I wish everyone would do.

donavanmover 12 years ago

Update strategy of master first is interesting. I've always seen the other way with update standby, flip to standby, verify, update original master. Auto inc db keys once again cause horribleness. Nothing new there I suppose. And as mentioned the multi dyno + DB read status page is craaaazy. Why oh why isnt this a couple static objects. Automagically generate and push if you want. Give 'em a 60 second TTL and call it a day. Put them behind a different CDN & DNS then the rest of your site for bonus points.

akoumjianover 12 years ago

I would love to know more about this two pass migration strategy.

评论 #4524376 未加载

cschepover 12 years ago

Interesting to read about github using MySQL instead of Postgres. Anyone know why? I am just curious because of all the MySQL bashing I hear in the echo chamber.

评论 #4524515 未加载

评论 #4524736 未加载

gbogover 12 years ago

Genuine question: github is built upon git, which is a rock solid system for storing dataand in these reports we read that github relies a lot on MySQL, so... Did the github guys ponder using git as their data store? Just an example, in git one can add comments on commits, would it be possible to use it for the github comment function? Or maybe it is?

评论 #4526266 未加载

lokotecla1over 12 years ago

para que sirve esta pagina soy nuevo