Cloudflare outage caused by bad software deploy

348 点作者 TomAnthony将近 6 年前

28 条评论

aleem将近 6 年前

If a single regex can take down the Internet for a half hour, that's definitely not good -- for a class of errors that can be easily prevented, tested, etc.The timing is unfortunate too, after calling out Verizon for lack of due process and negligence.I'm sure they have an undo or rollback for deployments but probably worth investing into further.They also need to resolve the catch-22 where people could not login and disable CloudFlare proxy ("orange cloud") since cloudflare.com itself was down.

评论 #20340285 未加载

评论 #20340407 未加载

评论 #20341255 未加载

评论 #20340713 未加载

评论 #20342488 未加载

neom将近 6 年前

I'm always beyond impressed with how responsive and transparent CF is with incidence and post mortem communication. Given who the CEO and COO are, I suppose this shouldn't be surprising, never the less as a customer it builds a great deal of trust. Kudos.

评论 #20339760 未加载

评论 #20340605 未加载

评论 #20340603 未加载

eganist将近 6 年前

Kinda wonder at this point what findings exist on their Availability SOC 2, assuming they've gotten one.The repeated outages plus the constant malicious advertising by scammy ad providers through cloudflare are slowly turning me off to the service as a potential enterprise customer. Unfortunate too since plenty of superlatively qualified people build great things there (hat tip to Nick Sullivan), but it seems like the build-fast culture may now be impeding the availability requirements of their clients.This is also a great example of a case where SLAs are meaningless without rigorous enforcement provisions negotiated in by enterprise clients. Cloudflare advertises 100% uptime (<a href="https://www.cloudflare.com/business-sla/" rel="nofollow">https://www.cloudflare.com/business-sla/</a>) but every time they fall over, they're down for what, an hour at a time? Just this one issue would've blown anyone else's 99.99% SLA out of the water -- <a href="https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr" rel="nofollow">https://www.cloudflarestatus.com/incidents/tx4pgxs6zxdr</a>I love the service, but if I'm to consider consuming the service, they'd do well to have the equivalent of a long term servicing branch as its own isolated environment, one where changes are only merged in once they've proven to be hyper-stable.

评论 #20339378 未加载

评论 #20338478 未加载

评论 #20339407 未加载

ti_ranger将近 6 年前

> We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents.Good.> Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.Wow. This seems like a very immature operational stance.Any deployment of any kind should be subject to minimum deployment safety, that they claim they have.> At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.Many large companies would have had automatic roll-back of this kind of change in less time than it took CloudFlare to (apparently) have humans decide to roll-back, and possibly before a single (usually not global) deployment had actually completed on all hosts/instances.However, what is more concerning is that it seems you shouldn't rely on CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to turn it off instead of correctly rolling back a bad deployment, which they only did > 43 minutes later:> We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.How were they not able to trivially roll back to the previous deployment?

评论 #20342172 未加载

grey-area将近 6 年前

I really want to know the regexp and corresponding input(s) which killed the internet now :) Was it just aaaaaaaaaaaah?<a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a>

评论 #20341359 未加载

评论 #20340766 未加载

评论 #20340589 未加载

rob-olmos将近 6 年前

For the size and importance of Cloudflare some insights to a couple questions would be nice:1. Why are WAF rules not progressively deployed since there's already a system to do so?2. Maybe there should also be a testing environment that receives a mirror of production traffic before deployments reach real users?(I understand the WAF change was not set to take action, but a separate environment would be less likely to affect production)

评论 #20340181 未加载

souterrain将近 6 年前

Cloudflare should write a guide to doing post-event communication. Or perhaps they shouldn’t, as this seems to be a potential differentiator.This is direct and doesn’t attempt to avoid blame. Well done.

评论 #20341045 未加载

lgats将近 6 年前

At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.So for about 50 minutes, those who relied on the WAF were open to attack?

评论 #20339464 未加载

txcwpalpha将近 6 年前

Peculiar that eastdakota (Cloudflare's CEO) doesn't seem to be tweeting at the Cloudflare team responsible for this, telling them they should be ashamed and are guilty of malpractice.When it was Verizon that took down the internet he felt it was appropriate to do that to the Verizon teams, after all.edit: right after posting this comment, he did tweet the following: <a href="https://twitter.com/eastdakota/status/1146196836035620864" rel="nofollow">https://twitter.com/eastdakota/status/1146196836035620864</a>> I'd say both we and Verizon deserve to be ashamed.As well as this: <a href="https://twitter.com/eastdakota/status/1146170209780113408" rel="nofollow">https://twitter.com/eastdakota/status/1146170209780113408</a>> Our team should be and is ashamed. And we deserve criticism. ...I still don't think that publicly shaming anyone is a good leadership style nor is it a good way to motivate people to perform better in the future, but kudos for the self-awareness, at least.

评论 #20340319 未加载

peterwwillis将近 6 年前

How to implement a multi-CDN strategy (streamroot.io): <a href="https://news.ycombinator.com/item?id=18399523" rel="nofollow">https://news.ycombinator.com/item?id=18399523</a>Etsy implementing multiple CDN (7 years ago, the CDNcontrol project looks abandoned): <a href="https://speakerdeck.com/ickymettle/integrating-multiple-cdn-providers-our-experience-at-etsy" rel="nofollow">https://speakerdeck.com/ickymettle/integrating-multiple-cdn-...</a> <a href="https://dyn.com/blog/speaking-with-etsy-about-multi-cdns-and-dns/" rel="nofollow">https://dyn.com/blog/speaking-with-etsy-about-multi-cdns-and...</a>Basically: you can try to keep a low TTL DNS, but it'll be more DNS traffic, and 5-10% of traffic takes forever to cut over because nobody respects TTL. Worst case you have just as much down time as before, best case most of your traffic is recovered in a few minutes.

评论 #20339712 未加载

cfors将近 6 年前

I wonder what happened with that poor regex expression.My thoughts are immediately shifting to one of my favorite articles of all time "Regular Expression Matching can be Simple and Fast..." [0][0] <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a>

UI_at_80x24将近 6 年前

Can anybody suggest a Systems Engineer-centric forum/site? (Not Windows 'help I can't print' level, more DataCenter grade.)HN does have some great content/replies that touch on these topics, but I'd like something more.

评论 #20339815 未加载

sequoia将近 6 年前

What sort of regular expression pitfalls can cause this sort of CPU utilization? I know they're possible but I am curious about specific examples of something similar to what caused Cloudflare's issue here.

评论 #20339507 未加载

评论 #20339503 未加载

评论 #20340877 未加载

评论 #20340115 未加载

评论 #20340441 未加载

djhworld将近 6 年前

A good war story there, at least the problem was relatively simple and quick to identify as the root cause, rather than something deeper.Would be interested to see what the gnarly regex was that was bombing their CPUs so hard!

mrzasa将近 6 年前

Shameless plug: understanding regex engine implementation can help with avoiding performance pitfalls: <a href="https://medium.com/textmaster-engineering/performance-of-regular-expressions-81371f569698" rel="nofollow">https://medium.com/textmaster-engineering/performance-of-reg...</a>

ksara将近 6 年前

>"It doesn't cost a provider like Verizon anything to have such limits in place. And there's no good reason, other than sloppiness or laziness, that they wouldn't have such limits in place."[1][1] <a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/" rel="nofollow">https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-...</a>

评论 #20341891 未加载

nodesocket将近 6 年前

It is interesting NGINX returned 502 nearly instantly under very heavy CPU load. I would have expected requests to just hang or timeout.

评论 #20339932 未加载

BentFranklin将近 6 年前

Kind of funny that it was a regexp.

评论 #20339436 未加载

gist将近 6 年前

Nothing like having what should be a world class company falling prey to the same type of screw-ups that plaque 'the local guy maintaining some wordpress site on a shared server'.Separately there is nothing that says that a company like Cloudflare has to air their dirty laundry (as the saying goes). The vast majority of 'customers' really don't care why something happened at all or the reason. All they know is that they don't have service.Pretend that the local electric company had a power outage (and it wasn't caused by some obvious weather event). Does it really matter if they tell people that 'some hardware we deployed failed and we are making sure it never happens again'. I know tech thinks they are great for these types of post-mortems but the truth is only tech people really care to hear them. (And guess what all it probably means is that that issue won't happen again...)

评论 #20339681 未加载

评论 #20340701 未加载

评论 #20339440 未加载

评论 #20339384 未加载

rubyn00bie将近 6 年前

This line kills me:> We were seeing an unprecedented CPU exhaustion event, which was novel for us as we had not experienced global CPU exhaustion before.I'd imagine it was quite novel for most anyone affected /s

quickthrower2将近 6 年前

Probably for the kind of work they are doing avoid regex? Or at least the very complicated modern regex (simple autonoma that you can compile in advance might be ok)

评论 #20340497 未加载

bdibs将近 6 年前

Seems like a terrible idea to deploy changes to such a vital piece of their software GLOBALLY without some sort of rollout procedure.

tolgahanuzun将近 6 年前

It is very difficult to explain this to customers who don't understand technology. 30 minutes is a very big time. :/

suchow将近 6 年前

Is there a usage error in the first sentence or has English lost the blog / blog post distinction?

评论 #20340776 未加载

tomcam将近 6 年前

I run a service placing bids the last few seconds on eBay. Every time this happens I lose measurable business (we place thousands of bids per day). While it doesn’t affect scheduled bids, they can’t place bids and are likely to move to a competitor. These recent outages have been costly.Does anyone know a more reliable provider?

评论 #20337112 未加载

评论 #20337710 未加载

评论 #20337420 未加载

BuddhaSource将近 6 年前

Do builds go through stage role out? For service like Cloudflare.

pearapps将近 6 年前

No way!?!?!??!?!?!?!??!?!??!?!?!

rodgerd将近 6 年前

Karma for shitting on Verizon, maybe.