Today's Outage Post Mortem

216 点作者 buttscicles大约 12 年前

27 条评论

druiid大约 12 年前

As always I'm glad to see Cloudflare post such detailed outage reports. They are one of the few providers I know of that is willing to go into such depth and that is one of the things I appreciate about them. That said, the outage that occurred was one that was indeed fully preventable. We don't exactly have as many locations as they do, but for internal resources at least, not pushing configuration changes to all devices (network included) is pretty standard practice. Basically I imagine for them a good routine to follow might be to script changes so that they are 'rolled out', something along the lines of push manual changes to a scripted 'random' router set (one in country A,B,C), wait 15 minutes and then push to the remaining router sets. That wouldn't work for all situations, such as if the entire network is seeing a DDoS or what have you, but I imagine they could adapt a routine that would prevent this particular scenario.With all of that said, as a Cloudflare customer and also having a call with them tomorrow scheduled already over the WAF stuff, I find it a bit... frustrating that this is occurring now and such a kind of mistake.Edit: As an aside, I wonder if the Puppet module for Junos will be extended to support route statements. That would make this kind of deployment much easier.

评论 #5314123 未加载

ryguytilidie大约 12 年前

This is pretty impressive. Keep in mind most of the team is on the west coast so this happened at 1am on a Sunday and they put up a post mortem within hours. Obviously you would prefer it not happen at all, but that is a great response imo.

评论 #5314025 未加载

评论 #5314244 未加载

powertower大约 12 年前

> CloudFlare currently runs 23 data centers worldwide.Shouldn't that always say - CloudFlare currently runs in 23 data centers worldwide?Or is that just how one would phrase that if you rent multiple racks or a cage in a datacenter? ...because I've seen that a bunch of times before from just about everyone.Just curious.

评论 #5315055 未加载

评论 #5314137 未加载

评论 #5315267 未加载

yRetsyM大约 12 年前

I'm not very educated on this end of the spectrum - but I wonder if a process is possible where a rule or router update of some description is applied to one router only, testing the specific schema before pushing to the rest of the routers, thereby failing one router and not failing the rest? I understand the need to respond as quickly as possible - but as stated in this case this was already a manual response.It appeals to my limited knowledge and non-existant experience that this would be a solution to the prevention of this occurring again in the future?

评论 #5313781 未加载

评论 #5313783 未加载

jcr大约 12 年前

To the couldflare folks; It's refreshing to see you take responsibility, but I think you've been a bit too hard on yourselves by taking all the blame. First of all, what you hit was a unknown bug in JunOS, and Juniper is to blame for their part. Using some form of staging to slow roll-out of rule changes might have saved you from a full meltdown, but when you're getting attacked, every second counts. Slow versus fast roll-out is one of those really tough balancing acts in your situation. You did a great job with it; by the time I saw the "cloudflare is down" post in the newest queue, it was already back up running again.

评论 #5314085 未加载

评论 #5314743 未加载

评论 #5315358 未加载

rainsford大约 12 年前

That was a pretty interesting writeup and I always like it when companies are totally (and quickly) upfront about negative events.One thing that occurred to me though is that performing a hard reboot of the routers required calling people to physically access the devices and took some time to perform (as you would expect). Although I wouldn't expect it to be needed very often, I'm sort of surprised CloudFlare doesn't have out-of-band remote power cycle capabilities.There may be some factor I'm not considering that would make that an unattractive option, but it does seem like it could cut down an already quick response time even further for any similar events in the future.

评论 #5314400 未加载

dododo大约 12 年前

if you want to build a reliable system, one useful thing to do is use equipment from multiple vendors. sure it's inconvenient, but by doing this you can often de-correlate failures. especially if you want to improve someone else's reliability.e.g., from simple things like hard drives in a raid from different vendors, to n-version programming in safety critical systems (like airplanes).

评论 #5313839 未加载

评论 #5315284 未加载

评论 #5313836 未加载

senthilnayagam大约 12 年前

So as far the DOS attack was very successful. It took down the site which it intended to and take down the network with lots and lots of the sites.Hope lessons are learnt and your next generation is less prone to these attacks

评论 #5313856 未加载

DigitalSea大约 12 年前

Rather unfortunate for the credibility of Cloudfare as a network provider, but you've got to admire them for their honesty and it'll work out better for them in the end. It's amazing how a few lines of code managed to bring down Cloudfare, they could have told us anything and nobody would have been able to question it; instead they gave us the truth and I really respect that. They didn't blame the intern, they didn't blame their hardware or make an excuse about a power outage. In terms of honesty Cloudfare seems to be leading the way regardless of their public credibility or image being tainted. Very impressive response time and resolution of the issue as well, good job Cloudfare!

rdl大约 12 年前

Wow, that's pretty fast turnaround for a post-mortem (although it looks to have been a simple problem, so easier to figure out what to write)

评论 #5313842 未加载

BoyWizard大约 12 年前

Two things:1. That video of the BGP routes disappearing is awesome, and2. A 40 minute outage sounds bad, but consider the following timeline (based on the writeup):> T+0: route change made, propagates> T+10: Response team online, attempting local fixes> T+30: Routers across 23 data centres in 14 countries hard reset and networks coming back up.

DoubleMalt大约 12 年前

Funny that now the post mortem is down ...

评论 #5317333 未加载

评论 #5317305 未加载

onemorepassword大约 12 年前

> Even though some data centers came back online initially, they fell back over again because all the traffic across our entire network hit them and overloaded their resources.I know very little of networking, but this seems to be a recurring pattern that aggravates many major outages. What surprises me is that this so often seems to be a scenario not accounted for.

评论 #5314302 未加载

jaequery大约 12 年前

this is the type of reason why i stopped using cloudflare. there are just too many eggs in one basket. it's as if their entire service becomes a SPOF to your infrastructure.

评论 #5314616 未加载

评论 #5314467 未加载

Flow大约 12 年前

I think you should have investigated why you got ~90kb packages despite having a max pkg size of ~4kb instead of putting in that rule. :)

评论 #5315270 未加载

noselasd大约 12 年前

> attack packets were between 99,971 and 99,985 bytes long.This should raise a red flag, as it must be impossible. Ethernet NICs would just bail out on packets longer than what you've set the MTU to, and ethernet frames would just come from the next hop in most cases. And IP packets have a max length field of 16 bit.

评论 #5314651 未加载

brokentone大约 12 年前

Impressive response. 30 minute outage for something most of the hosts I've worked with in the past would have been mystified about for hours. Then a quick RFO and promise of proactive SLA adjustments? Next time I need a CDN or attack mitigation I'll be talking to Cloudflare

tedchs大约 12 年前

What I don't understand is why Cloudflare is making changes to their border routers in the process of protecting their customers. I am a network engineer and I love Juniper, but the reality is with any complex system, every change you make has a possibility of inducing an unexpected failure. I would think Cloudflare would have increased stability by using an architecture where the border routers have a mostly static config, and there is a set of firewalls (e.g. Juniper SRX 5800) behind them that are doing the actual filtering and changing configs in response to threats.

评论 #5317428 未加载

random42大约 12 年前

OT: I want to pitch cloudflare for our CDN needs. Can someone estimate the scale of cloudflare wrt. akamai (current provider), in terms of operations, consumers etc.?

评论 #5315581 未加载

评论 #5315250 未加载

评论 #5321488 未加载

contingencies大约 12 年前

Developing good software comes down to consistently carrying out fundamental practices (regardless of the technology) - Paul M. DuvallIn this case: Development. Versioned change. Test or staging environment. Tests pass. Production.

评论 #5316025 未加载

lazyjones大约 12 年前

So what are they going to change as a consequence? It seems logical to not rely on a single router vendor anymore, or to test new rules on a staging setup at least for a very short time before pushing them to all routers.

评论 #5313797 未加载

评论 #5314012 未加载

TranceMan大约 12 年前

Just wondering if the source of the large packets were from a [large] range of hosts or maybe a single host?Ouch if a single host activity took down ~750k websites - whether deliberate and direct or not.

评论 #5315287 未加载

ralph大约 12 年前

Presumably Jupiter's Junos is closed-source, making investigation more difficult? Do they provide it to some of their bigger clients under an agreement?

Ecio78大约 12 年前

I got a "Oops there was a problem" page from Postereous trying to open the blog page/site...

rschmitty大约 12 年前

Now we need a Post Mortem on the Post Mortem, as it is now down"Oh noes! Something went wrong."

newman314大约 12 年前

Posterous seems to be down.

graycat大约 12 年前

Yes, case number 384,449,194 of systems management causing a system problem. Also case number 439,224 of what looked like a localized problem quickly causing a huge system, e.g., all 23 data centers around the world, to crash.They have my sympathy: So, they typed in a 'rule'. At one time I was working in 'artificial intelligence' (AI), actually 'expert systems', based on using 'rules' to implement real time management of server farms and networks. Of course, in that work, goals included 'lights out data centers', that is, don't need people walking around doing manual work but not the case of 'lights out' as in the CloudFlare outage, and very high reliability.Looking into reliability, that is, putting into a few, broad categories the causes of outages, a category causing a large fraction of the outages was humans doing system management, or as in the words of the HAL 9000, "human error". Yup.And the whole thing went down? Yup: One example we worked with was system management of a 'cluster'. Well, one of the computers in the cluster "went a little funny, a little funny in the head" and was throwing all its incoming work into its 'bit bucket'. So, the CPU busy metric on that computer was not very high, and the load leveling started sending nearly all the work to that one computer and, thus, into its bit bucket and, thus, effectively killed the work of the whole cluster.As one response I decided that real time monitoring of a cluster, or any system that is supposed to be 'well balanced' via some version of 'load leveling', should include looking for 'out of balance' situations.So, let's see: Such monitoring can have false positives (false alarms) and false negetives (missed detections). So, such monitoring is necessarily essentially a case of some statistical hypothesis testing, typically with the 'null hypothesis' that the system is healthy, applied continually in near real-time. So, for monitoring 'balancing', we will likely have to work with multi-dimensional data. Next, our chances of knowing the probability distribution of that data, even in the case of a healthy system, is from slim down to none. So we need a statistical hypothesis test that is both multi-dimensional and distribution-free.So, CloudFlare's problems are not really new!I went ahead and did some work, math, prototype software, etc. and maybe someday it will be useful, but it wouldn't have helped CloudFlare here if only because they needed no help noticing that all their systems around the world were crashing.In our work on AI, at times we visited some high end sites, and in some cases we found some extreme, high up off the tops of the charts, concern and discipline for who, what, or why any humans could take any system management actions. E.g., they had learned the lesson that can't let someone just type in a new rule in a production system. Why? Because it was explained that one outage in a year, and the CIO would lose his bonus. Two outages and he would lose his job. Net, we're talking very high concern. No doubt CloudFlare will install lots of discipline around humans taking system management actions on their production systems.Net, I can't blame CloudFlare. If my business gets big enough to need their services, they will be high on the list of companies I will call first!

评论 #5315599 未加载

评论 #5315601 未加载