> One of the biggest customer-facing effects of this delay was that status.github.com wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.<p>Amazon could learn a thing or two from Github in terms of understanding customer expectations.
There's no mention of why they don't have redundant systems in more than one datacenter. As they say, it is unavoidable to have power or connectivity disruptions in a datacenter. This is why reliable configurations have redundancy in another datacenter elsewhere in the world.
For all that work to be done in just two hours is amazing, especially with degraded internal tools, and both hardware and ops teams working simultaneously.
I don't know enough about server infrastructure to comment on whether or not Github was adequately prepared or reacted appropriately to fix the problem.<p>But wow it is refreshing to hear a company take full responsibility and own up to a mistake/failure and apologize for it.<p>Like people, all companies will make mistakes and have momentary problems. It's normal. So own up to it and learn how to avoid the mistake in the future.
Does Github run anything like Netflix Simbian Army against it's services? As a company by engineers for engineers with the scale that github has reached, I'm a bit surprised they are lacking a bit more redundancy. Though they may not need the uptime of netflix, an outage of more than a few minutes on github could affect businesses that rely on the service.
Every time I read about a massive systems failure, I think of Jurassic Park and am mildly grateful that the velociraptor padock wasn't depending on the systems operation.
This just shows how difficult it is to avoid hidden dependencies without a complete, cleanly isolated, testing environment of sufficient scale to replicate production operations and do strange system fault scenarios somewhere that won't kill production.
> ... Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.<p>That's an awesome idea. I wish all companies published the firmware releases in simple rss feeds, so everyone could easily integrate them with their trackers.<p>(If someone's bored, that may be a nice service actually ;) )
> Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized.<p>I'm getting flashbacks. All of the servers in the DC reboot and NONE of them come online. No network or anything. Even remotely rebooting them again we had nothing. Finally getting a screen (which is a pain in itself) we saw they were all stuck on a grub screen. Grub detected an error and decided not to boot automatically. Needless to say we patched grubbed and removed this "feature" promptly!
You can very clearly see two kinds of people posting on this thread: those who have actually dealt with failures of complex distributed systems, and those who think it's easy.
<i>"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet."</i><p>Tell us which vendor shipped that firmware, so everyone else can stop buying from them.
I feel it was good incident for the Open Source community, to see how dependent we are on GitHub today. I feel sad whenever I see another large project like Python moving to GitHub, a closed-sourced company. I know, GitLab is there as an alternative, but I would love to see all the big Open Source projects putting pressure over GitHub to make them open their source code, as right they are big player in open source, like it or not.
It must be nice to know that the majority of your customers are familiar enough with the nature of your work that they'll actually understand a relatively complex issue like this. Almost by definition, we've all been there.
If only Bitbucket could give such comprehensive reports. A few months back outages seemed almost daily. Things are more stable now. I hope for the long term.
> Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours.<p>I don't mean to be blasphemous, but from a high level, is the performance issues with Ruby (and Rails) that necessitate close binding with Redis (i.e., lots of caching) part of the issue?<p>It sounds like the fundamental issue is not Ruby, nor Redis, but the close coupling between them. That's sort of interesting.
If you use Redis, you should try out Dynomite at <a href="http://github.com/Netflix/Dynomite" rel="nofollow">http://github.com/Netflix/Dynomite</a>. It can provide HA for Redis servers
I would have expected there to be a notification system owned by the DC that literally send an email to clients saying "Power blipped / failed".<p>That would have given them immediate co text and not wasting time on DDOS protection
So, while it sounds like they have reasonable HA, they fell down on DR.
unrelated, I could not comprehend what this means?..:
technicians to bring these servers back online by draining the flea power to bring<p>Flea power?
TL;DR: "We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, ..."<p>This doesn't sound very good.
> We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.<p>I seem to recall a recent post on here about how you shouldn't have such hard dependencies. It's good advice.<p>Incidentally, this type of dependency is unlikely to happen if you have a shared-nothing model (like PHP has, for instance), because in such a system each request is isolated and tries to connect on its own.
> Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.<p>The thing that fixed the last problem doesn't always fix the current problem.
I seriously doubt this version of the story. While it's possible for several hardware/firmware to fail in all your datacenters, for them to fail at the same time is highly unlikely. This may just be a PR spin to think they're not vulnerable to security attacks.<p>While this was happening at Github, I noticed several other companies facing that same issue at the same time. Atlassian was down for the most part. It could have been an issue with the service github uses, but they won't admit that. Notice they never said what the firmware issue was instead blaming it on 'hardware'.<p>I think they should be transparent with people about such vulnerability, but I suspect they would never say so because then they would lose revenue.<p>Here on my blog I talked about this issue: <a href="http://julesjaypaulynice.com/simple-server-malicious-attacks/" rel="nofollow">http://julesjaypaulynice.com/simple-server-malicious-attacks...</a><p>I think it was some ddos campaign going on over the web.