January 28th Incident Report

451 pointsby Oompaover 9 years ago

27 comments

eric_hover 9 years ago

> One of the biggest customer-facing effects of this delay was that status.github.com wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.Amazon could learn a thing or two from Github in terms of understanding customer expectations.

评论 #11031069 未加载

评论 #11032830 未加载

bosdevover 9 years ago

There's no mention of why they don't have redundant systems in more than one datacenter. As they say, it is unavoidable to have power or connectivity disruptions in a datacenter. This is why reliable configurations have redundancy in another datacenter elsewhere in the world.

评论 #11030063 未加载

评论 #11031046 未加载

评论 #11030806 未加载

评论 #11030049 未加载

danielvfover 9 years ago

For all that work to be done in just two hours is amazing, especially with degraded internal tools, and both hardware and ops teams working simultaneously.

评论 #11030187 未加载

评论 #11030624 未加载

评论 #11032426 未加载

DarkTreeover 9 years ago

I don't know enough about server infrastructure to comment on whether or not Github was adequately prepared or reacted appropriately to fix the problem.But wow it is refreshing to hear a company take full responsibility and own up to a mistake/failure and apologize for it.Like people, all companies will make mistakes and have momentary problems. It's normal. So own up to it and learn how to avoid the mistake in the future.

评论 #11031081 未加载

pedalpeteover 9 years ago

Does Github run anything like Netflix Simbian Army against it's services? As a company by engineers for engineers with the scale that github has reached, I'm a bit surprised they are lacking a bit more redundancy. Though they may not need the uptime of netflix, an outage of more than a few minutes on github could affect businesses that rely on the service.

评论 #11030967 未加载

评论 #11032052 未加载

评论 #11030335 未加载

onetwotreeover 9 years ago

Every time I read about a massive systems failure, I think of Jurassic Park and am mildly grateful that the velociraptor padock wasn't depending on the systems operation.

评论 #11030427 未加载

评论 #11030429 未加载

mjevansover 9 years ago

This just shows how difficult it is to avoid hidden dependencies without a complete, cleanly isolated, testing environment of sufficient scale to replicate production operations and do strange system fault scenarios somewhere that won't kill production.

评论 #11030065 未加载

评论 #11030081 未加载

viraptorover 9 years ago

> ... Updating our tooling to automatically open issues for the team when new firmware updates are available will force us to review the changelogs against our environment.That's an awesome idea. I wish all companies published the firmware releases in simple rss feeds, so everyone could easily integrate them with their trackers.(If someone's bored, that may be a nice service actually ;) )

评论 #11030763 未加载

评论 #11030325 未加载

matt_wulfeckover 9 years ago

> Remote access console screenshots from the failed hardware showed boot failures because the physical drives were no longer recognized.I'm getting flashbacks. All of the servers in the DC reboot and NONE of them come online. No network or anything. Even remotely rebooting them again we had nothing. Finally getting a screen (which is a pain in itself) we saw they were all stuck on a grub screen. Grub detected an error and decided not to boot automatically. Needless to say we patched grubbed and removed this "feature" promptly!

gaiusover 9 years ago

You can very clearly see two kinds of people posting on this thread: those who have actually dealt with failures of complex distributed systems, and those who think it's easy.

Animatsover 9 years ago

"We identified the hardware issue resulting in servers being unable to view their own drives after power-cycling as a known firmware issue that we are updating across our fleet."Tell us which vendor shipped that firmware, so everyone else can stop buying from them.

评论 #11032286 未加载

merqurioover 9 years ago

I feel it was good incident for the Open Source community, to see how dependent we are on GitHub today. I feel sad whenever I see another large project like Python moving to GitHub, a closed-sourced company. I know, GitLab is there as an alternative, but I would love to see all the big Open Source projects putting pressure over GitHub to make them open their source code, as right they are big player in open source, like it or not.

评论 #11030259 未加载

评论 #11030245 未加载

评论 #11030084 未加载

评论 #11030736 未加载

rqebmmover 9 years ago

It must be nice to know that the majority of your customers are familiar enough with the nature of your work that they'll actually understand a relatively complex issue like this. Almost by definition, we've all been there.

dsmithatxover 9 years ago

If only Bitbucket could give such comprehensive reports. A few months back outages seemed almost daily. Things are more stable now. I hope for the long term.

评论 #11030128 未加载

gueloover 9 years ago

Weird that they didn't say what caused the power outage and what the mitigations are for that.

评论 #11030940 未加载

评论 #11033633 未加载

评论 #11031168 未加载

tmshover 9 years ago

> Over the past week, we have devoted significant time and effort towards understanding the nature of the cascading failure which led to GitHub being unavailable for over two hours.I don't mean to be blasphemous, but from a high level, is the performance issues with Ruby (and Rails) that necessitate close binding with Redis (i.e., lots of caching) part of the issue?It sounds like the fundamental issue is not Ruby, nor Redis, but the close coupling between them. That's sort of interesting.

评论 #11031663 未加载

评论 #11031206 未加载

评论 #11030917 未加载

cognivoreover 9 years ago

Um, work from your local cache for a few hours? It's that the one of the main reasons for git?

评论 #11032900 未加载

timiblossomover 9 years ago

If you use Redis, you should try out Dynomite at <a href="http://github.com/Netflix/Dynomite" rel="nofollow">http://github.com/Netflix/Dynomite</a>. It can provide HA for Redis servers

rurounijonesover 9 years ago

I would have expected there to be a notification system owned by the DC that literally send an email to clients saying "Power blipped / failed".That would have given them immediate co text and not wasting time on DDOS protection

spydumover 9 years ago

So, while it sounds like they have reasonable HA, they fell down on DR. unrelated, I could not comprehend what this means?..: technicians to bring these servers back online by draining the flea power to bringFlea power?

评论 #11030691 未加载

tonylxcover 9 years ago

TL;DR: "We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, ..."This doesn't sound very good.

评论 #11032174 未加载

评论 #11030592 未加载

评论 #11031986 未加载

评论 #11030844 未加载

评论 #11030594 未加载

mattdeboardover 9 years ago

Anyone have a link to a description of the firmware bug that caused the disk-mounting failure after power was restored?

评论 #11032265 未加载

TazeTSchnitzelover 9 years ago

> We had inadvertently added a hard dependency on our Redis cluster being available within the boot path of our application code.I seem to recall a recent post on here about how you shouldn't have such hard dependencies. It's good advice.Incidentally, this type of dependency is unlikely to happen if you have a shared-nothing model (like PHP has, for instance), because in such a system each request is isolated and tries to connect on its own.

totallyover 9 years ago

> Because we have experience mitigating DDoS attacks, our response procedure is now habit and we are pleased we could act quickly and confidently without distracting other efforts to resolve the incident.The thing that fixed the last problem doesn't always fix the current problem.

评论 #11031310 未加载

swrobelover 9 years ago

Anyone got a good tl;dr version?

评论 #11030244 未加载

评论 #11030579 未加载

评论 #11030258 未加载

评论 #11032526 未加载

评论 #11030433 未加载

评论 #11031230 未加载

jargonlessover 9 years ago

What is this "HA" jargon?I would STFW, but searching for "HA" isn't helpful.

评论 #11032431 未加载

评论 #11030439 未加载

评论 #11030446 未加载

评论 #11031300 未加载

评论 #11030965 未加载

评论 #11030943 未加载

评论 #11030640 未加载

评论 #11030457 未加载

julesbond007over 9 years ago

I seriously doubt this version of the story. While it's possible for several hardware/firmware to fail in all your datacenters, for them to fail at the same time is highly unlikely. This may just be a PR spin to think they're not vulnerable to security attacks.While this was happening at Github, I noticed several other companies facing that same issue at the same time. Atlassian was down for the most part. It could have been an issue with the service github uses, but they won't admit that. Notice they never said what the firmware issue was instead blaming it on 'hardware'.I think they should be transparent with people about such vulnerability, but I suspect they would never say so because then they would lose revenue.Here on my blog I talked about this issue: <a href="http://julesjaypaulynice.com/simple-server-malicious-attacks/" rel="nofollow">http://julesjaypaulynice.com/simple-server-malicious-attacks...</a>I think it was some ddos campaign going on over the web.

评论 #11032275 未加载