Cloudflare outage on July 17, 2020

522 pointsby tomkleinalmost 5 years ago

37 comments

QuentinMalmost 5 years ago

Head of DevOps at a major financial exchange where latency & resiliency is at the heart of our business, and yes, we pay Cloudflare millions. I see two things here:# Just be readyMost definitely not the first time Cloudflare has had trouble, just like any other system: it will fail eventually. If you're complaining about the outage, ask yourself the question: why were not you prepared for this eventuality?Spread your name servers, and use short-TTL weighted CNAMEs, defaulting to say, 99% Cloudflare, 1% your internal load balancer. The minute Cloudflare seems problematic, make it 0% 100% to bypass Cloudflare’s infrastructure completely. This should be tested periodically to ensure that your backends are able to scale & take the load without shedding due to the lack of CDN.# Management practicesCloudflare's core business is networking. It actually embarrasses me to see that Cloudflare YOLO'd a BGP change in a Juniper terminal without peer reviews and/or without a proper administration dashboard, exposing safe(guarded) operations, a simulation engine and co.? In particular, re-routing traffic / bypassing POPs must be a frequent task at scale, how can that not be automated so to avoid human mistakes?If you look at the power rails of serious data centers out there, you will quickly notice that those systems, although built 3x for the purpose of still being redundant during maintenance periods, are heavily safeguarded and automated. While technicians often have to replace power elements, the maintenance access is highly restricted with unsafe functions tiered behind physical restrictions. An example of a common function that's safeguarded is the automatic denial of an input command that would shift electrical load onto lines beyond their designed capacity - which could happen by mistake if the technician made a bad assumption (e.g. load sharing line is up while it's down) or if the assumption became violated since last check (e.g. load sharing line was up when checked, became down at a later time - milliseconds before the input even).

评论 #23881122 未加载

评论 #23879035 未加载

评论 #23878860 未加载

评论 #23884426 未加载

评论 #23879904 未加载

评论 #23879677 未加载

评论 #23878922 未加载

评论 #23880666 未加载

评论 #23887099 未加载

评论 #23880765 未加载

评论 #23882196 未加载

评论 #23882821 未加载

redmalmost 5 years ago

CloudFlare is a good company and everyone has outages. IMHO the post-mortems they post are not only some of the best I've read from a big company, but they are produced quickly.I only wish they could update cloudflarestatus.com more quickly. Shouldn't there be some mechanism to update that immediately when there is an incident? When the entire internet knows your down and your status page says All Systems GO! it looks very poorly on them.

评论 #23878756 未加载

eastdakotaalmost 5 years ago

Here's our blog post detailing what happened today and the mitigations we've put in place to ensure it doesn't happen again in the future: <a href="https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/" rel="nofollow">https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...</a>

评论 #23878254 未加载

评论 #23878346 未加载

评论 #23878258 未加载

评论 #23878224 未加载

评论 #23878424 未加载

评论 #23878585 未加载

katzgraualmost 5 years ago

I was on a call with an investor and an employee mouthed to me, silently, "everything is down!"Immediate hot flash.After I got off the call (thank god he had an appointment to run to), I checked it out. Our internal dashboards were all green so we realized it was a DNS issue pretty quickly.Since we couldn't get into Cloudflare we searched Twitter and realized it was their issue and I stopped worrying.One of the benefits of CF and other major internet vendors is that when they're down, you can kind of shrug it off and tell your customers to wait a bit. Not so if you're using a smaller/unknown CDN company.

评论 #23878495 未加载

评论 #23880978 未加载

评论 #23881255 未加载

评论 #23883236 未加载

评论 #23878501 未加载

cflewisalmost 5 years ago

Compare this to Facebooks SDK “postmortem” and you can tell which company cares more about its customers.

评论 #23878469 未加载

评论 #23878556 未加载

combatentropyalmost 5 years ago

It's seems these major infrastructure outages always are from a configuration change. I remember Google had a catastrophic outage a few years ago, and the postmortem said it all began as a benign configuration update that snowballed worldwide. In fact I tried googling for it and found the postmortem of a more recent outage, also from a configuration change.Some seasoned sysadmin will say to me, "Of course it's always from a configuration change. What else could it be?" I don't know, it seems like there are other possible causes. But in today's superautomated infrastructures, maybe config files are the last soft spot.

评论 #23879030 未加载

Rapzidalmost 5 years ago

> We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.I'm not even sure.. Is that second sentence supposed to signal some sort of success? Dropping 50% of your traffic isn't isolated. If your gonna try to spin it, at least bury the damn lede. Further:> The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.Locations with THEIR equipment, but certainly not all "affected" locations. I live 4 hours from Dallas and can assure you that I was impacted. That coverage is like.. Most of the United States, Europe, Brazil and who knows how much of South America? Oh right, 50% of their traffic!

评论 #23878936 未加载

评论 #23880240 未加载

spenczar5almost 5 years ago

Wow, BGP brings down globally-used DNS. It’s like a perfect lesson in weak points of the modern web’s accidental design.

评论 #23877988 未加载

rexarexalmost 5 years ago

Network Engineering: Invisible when you’re killing it at your job, instantly the enemy when you make a mistake.

dahfizzalmost 5 years ago

We need to do something about BGP.Just in the past year Verizon, IBM, Apple, now Cloudflare have seen outages from BGP misconfiguration. The Verizon issue took down a significant part of the internet.BGP is a liability to society. We need something which doesn't constantly cause widespread outages.

评论 #23878601 未加载

评论 #23878906 未加载

评论 #23878675 未加载

评论 #23881718 未加载

simonswords82almost 5 years ago

In the last three years we have hosted our enterprise software on Azure, and the only outages we've had have been caused by mistakes or issues at Cloudflare. Azure has been rock solid but our customers don't understand that and assume that we're "just down", which impacts our SLAs.During the most recent outage a few weeks ago, Azure were available to discuss the issue by phone. I wish I could say the same for Cloudflare.I would be interested to hear from anybody who knows of a good alternative to Cloudflare. I'm completely fed up with them.

评论 #23881143 未加载

bogomipzalmost 5 years ago

>"We are making the following changes:Introduce a maximum-prefix limit on our backbone BGP sessions - this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident."It should be noted that configuring prefix limits for your BGP peers is kind of BGP 101. It's mentioned in every "BGP Best Practices" type document.[1] It's there for exactly this purpose to prevent router meltdown and resource exhaustion. For a company who blows their horn as much as these folks seem to about their network this is embarrassing.I think it's worth mentioning that it was this time last year when Verizon bungled their own BGP configuration and brought down parts of the internet. When that incident occurred Cloudflare's CEO was front and center excoriating them for accepting routes without basic filtering [2]. This is exact same class of misconfiguration that befell them yesterday.[1] <a href="https://team-cymru.com/community-services/templates/secure-bgp-template/" rel="nofollow">https://team-cymru.com/community-services/templates/secure-b...</a>[2] <a href="https://twitter.com/eastdakota/status/1143182575680143361?lang=en" rel="nofollow">https://twitter.com/eastdakota/status/1143182575680143361?la...</a>

rob-olmosalmost 5 years ago

Commented on the original outage: Hopefully with this outage Cloudflare will provide non-Enterprise plans a CNAME record, allowing us to not use Cloudflare DNS and more quickly bypass Cloudflare if the need arises.

评论 #23878190 未加载

评论 #23878487 未加载

评论 #23878094 未加载

iso947almost 5 years ago

2 years ago, about 1 in 3 people in the UK were watching England in the World Cup.Towards the end of the game, a CDN the bbc used crashed, taking a million people’s live streams offline.Traditional TV with its 20 million plus viewers worked fine.A 15 minute global outage during the World Cup or Super Bowl is not acceptable in the world of boring old TVMeanwhile github has been down how many times this year?IT is still a terrible industry of cowboys. It’s just hidden under the veneer of abstaction, microservices and outsourcing. Other industries like the national grid or water of radio have outages that affect a local geographic area or a limited number of people, but they are far more distributed than the modern internet. It’s ironic a network designed to survive nuclear war can’t survive a typo.<a href="https://m.huffingtonpost.co.uk/entry/bbc-iplayer-crashes-in-final-minutes-of-england-v-sweden-in-the-world-cup_uk_5b40e196e4b09e4a8b2d88fc" rel="nofollow">https://m.huffingtonpost.co.uk/entry/bbc-iplayer-crashes-in-...</a>

评论 #23882189 未加载

maxdoalmost 5 years ago

what is funny you go to cloudflare customers page, check all these companies status page, all down, non of them admits it's due to third party cloud provider e.g. cloudflare. In most cases it was "performance issue". Its so silly... you're in the interconnected world. It's ok your major cloud providers went down...

itsjlohalmost 5 years ago

Why did it take so long for a status page to be published I wonder?From the timeline in the blog post the issue with Atlanta was fixed between 21:39 to 21:47 but a status page wasn't published until 21:37. Everything had been broken for over 20 minutes at that stage with lots of people already posting about it or other status pages reflecting issues. See <a href="https://twitter.com/OhNoItsFusl/status/1284239769548005376" rel="nofollow">https://twitter.com/OhNoItsFusl/status/1284239769548005376</a> or <a href="https://twitter.com/npmstatus/status/1284235702540984321" rel="nofollow">https://twitter.com/npmstatus/status/1284235702540984321</a>Without an accurate status page it leaves businesses pointing the finger everywhere wondering whether its their hosting provider having issues, their CDN, DNS provider etc etc.

评论 #23878496 未加载

superkuhalmost 5 years ago

You mean the service that everyone is centralizing in caused problems because everyone centralized in it? Pikachu shock face. If you're web or network dev act responsibly. Don't just pick cloudflare because everyone else does. Don't pick cloudflare because everyone else does.

eric_khunalmost 5 years ago

I got the same issue today when I was on-call. took me 1 hour to figure out it was Cloudflare.I'm currently working on a project to monitor all the 3rd party stack you use for your applications. Hit me up if you want, access I'll give free access for a year+ to some folks to get feedbacks.

评论 #23881846 未加载

iJohnDoealmost 5 years ago

Cloudflare is experiencing a bit of karma and a bit of Murphy’s Law.They slung some mud not long ago and now it came to bite them. They were a bit righteous on their reliability. However, anyone in this game long enough knows it’s only a matter of time before shit goes down. If they didn’t have any graybeards over there to tell them this, then hopefully they earned some gray.Stuff was down long enough for Googles and OpenDNS caches to expire, and to take down DigitalOcean in some respects.Thankfully CF can afford to learn and make improvements for the future. Not all organizations are that lucky.

评论 #23879629 未加载

评论 #23879151 未加载

devyalmost 5 years ago

This is not the first time human error on BGP routing configuration which then caused a significant portion of the Internet down. Is there any kind of configuration validator that can be implemented to prevent and catch this type of errors? I am fairly sure this won't be the last time we will hear about human error on BGP routing config causing Internet down.Or is BGP intrinsically a unsafe protocol without builtin protections on this sort of human mistakes?

评论 #23879102 未加载

评论 #23879956 未加载

apaprockialmost 5 years ago

It would be interesting to classify network outages and determine the number that involved practices that would be obviated by a standard VCS / release process like found in software. Routers/firewalls seem to be a particular pain point everywhere.

jlgaddisalmost 5 years ago

I guess it's easy to become complacent when you're a networking expert at Cloudflare and likely making several of these ad-hoc, on-the-fly config changes every single day, but it's always good to remember why Juniper introduced the automatic rollback feature.Of course, this particular outage would not have been prevented even if they had used<pre><code> # commit confirmed </code></pre> as it can't stop you from screwing up but it almost certainly would have limited the duration of the outage to ~10 minutes (plus a minute or two, perhaps, for the network to recover after the rollback) -- and it could have been shorter than that had they used "commit confirmed 3", for example.Even as a lowly network engineer working on networks much, much smaller than Cloudflare's, for pretty much my entire professional career, it's been my standard practice to start off pretty much ANY change -- no matter how trivial -- with<pre><code> # commit confirmed 3 </code></pre> or<pre><code> # reload in 3 </code></pre> or similar, depending on what type of gear I was working on (and, of course, assuming the vendor supported such a feature).This applies even when making a changes that are so "simple" that they just "can't" go wrong or have any unexpected or unintended effects.In fact, it applies ESPECIALLY in those case! It's when you let your guard down that you'll get hit.---Fortunately, all that was necessary (I assume) to recover in this case was to<pre><code> # rollback </code></pre> to the previous configuration. Then, the correct configuration could be made. That still had to be done manually, however, and resulted in a 27 minute outage instead of what could have been a 5 or 10 minute outage.I would hope that Cloudflare has full out-of-band access to all of their gear and are able to easily recover from mistakes like this. If they had lost access to the Atlanta router and weren't able to log in and revert the configuration manually, this outage could have lasted much, much longer.

mathattackalmost 5 years ago

Did this sink any eCommerce websites?

评论 #23878491 未加载

评论 #23878833 未加载

jitbitalmost 5 years ago

Everyone has outages, CloudFlare is a decent company making a good product.BUTWhat's interesting here is that so many non-CloudFlare services went down (including even AWS - partially) caused by DNS outage - because every sysadmin and his mom are using 1.1.1.1 as their DNS.

评论 #23881389 未加载

sm2ialmost 5 years ago

looks like Tesuto was a good buy<a href="https://investors.fastly.com/news/news-details/2020/Fastly-Achieves-100-Tbps-of-Edge-Capacity-Milestone/default.aspx" rel="nofollow">https://investors.fastly.com/news/news-details/2020/Fastly-A...</a>:> By emulating networks at scale, Tesuto’s technology can be used to create sandbox environments that simulate the entire Fastly network, providing a view into the potential impact of a deployment cycle before it is put into production.

kissgyorgyalmost 5 years ago

It's fine that they made mistakes and there were an outage, shit happens to everybody.What is really scary though that half of the internet stopped working. That's not ok!

bamboozledalmost 5 years ago

It's amazing to me that in this situation, humans still had to intervene and update a configuration file.I'm surprised stuff like this wouldn't happen more often and there would at least be a well tested, automated remediation step in place which also validates the change prior to going live.I get they may be busy solving other issues, but it's interesting this isn't a more fool proof procedure given the huge impact a mistake can have.

microcolonelalmost 5 years ago

I wish they would go into why the rather complete outage was not visible to cloudflarestatus.com. I fully understand that mistakes can be made, but I'm really not pleased with how hard it was to tell if I was experiencing a localized issue. During the entire outage, cloudflarestatus.com displayed "all systems operational" for me, once accessed it with a functioning DNS resolver.

tristoralmost 5 years ago

Fun, I got hit by this since I use cloudflared behind my Pi-Hole. I was able to troubleshoot the issue, localized the cause to Cloudflare, found the partial outage in various regions and assumed I was affected, switched to using Level 3 DNS temporarily. I'm glad to see it's back up and this is a great retrosepctive.

评论 #23881406 未加载

Fritsdehackeralmost 5 years ago

I don't understand Cloudeflare is making direct configuration changes to a router like this. If these are changes that are made regularly, why not use a tool to make them. You can then assure that only certain changes are possible, preventing simple mistakes like this.

MayeulCalmost 5 years ago

Hmm, duckduckgo and qwant also seem to have some trouble, time to head for <a href="https://searx.space/" rel="nofollow">https://searx.space/</a> I guess?

cosudalmost 5 years ago

Fascinating to see that airport codes are used as datacenter names. I know at least one other company which does that but I thought that's something peculiar to them.

评论 #23882950 未加载

badrabbitalmost 5 years ago

Is there any work being done to replace BGP or current IGP's? Wondering if modern computing and memory capacity and algorithms can be used to make more fail-safe protocols.

评论 #23878411 未加载

kj4ipsalmost 5 years ago

Random question:The Cloudflare revolvers definitely went down (1.1.1.1 and 1.0.0.1), do we know if authoratative DNS did?

H8crilAalmost 5 years ago

Large portion of some network is down? Oh, right, it's BGP. It's always BGP.

exabrialalmost 5 years ago

What configuration error? Was this human or automatic? What was done to mitigate?

m3kw9almost 5 years ago

It shows that the system is pretty fragile.