Details of yesterday's Bunny CDN outage

175 pointsby aSigalmost 4 years ago

24 comments

ram_raralmost 4 years ago

> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database. Unfortunately, this managed to upload a corrupted file to the Edge Storage.I wonder, if simple checksum verification of the file would have helped in avoiding this outage all together.> Turns out, the corrupted file caused the BinaryPack serialization library to immediately execute itself with a stack overflow exception, bypassing any exception handling and just exiting the process. Within minutes, our global DNS server fleet of close to a 100 servers was practically deadThis is exactly, why one needs a canary based deployments. I have seen umpteen amounts of issues being caught in canary, which has saved my team tons of firefighting time.

评论 #27603168 未加载

评论 #27603898 未加载

评论 #27614102 未加载

YetAnotherNickalmost 4 years ago

They are making it sound like they did everything right and it was a issue of third party library. If we list all the libraries our code depend on, it will be in 1000s. I can't comprehend how a CDN does not have any canary or staging setup and in a update everything could go haywire in seconds. I think it is standard practice in any decent size company to have staging/canary and rollbacks.

评论 #27602993 未加载

评论 #27603917 未加载

评论 #27603061 未加载

stringalmost 4 years ago

Good and clear explanation. This is a risk you take when you use a CDN, I still think the benefits outweigh the occasional downtime. I'm a big fan of BunnyCDN, they've saved me a lot of money over the past few years.I'm sure I'd be fuming if I worked at some multi-million dollar company but as someone that mainly works for smaller businesses it's not the end of the world, I suspect most of my clients haven't even noticed yet.

评论 #27603082 未加载

foobarbazetcalmost 4 years ago

I like how something is “auto-healing” when it’s like… has `Restart=on-failure` in systemd.Anyway, it’s always DNS. Always.“Unfortunately, that allowed something as simple as a corrupted file to crash down multiple layers of redundancy with no real way of bringing things back up.”You can spend many, many millions of $ on multi-AZ Kubernetes microservices blah blah blah and it’ll still be taken down by a SPOF, which, 99% of the time, is DNS.Actual redundancy, as opposed to “redundancy”, is extremely difficult to achieve because the incremental costs of one more 9 are almost exponential.And then a customer updates their configuration and your entire global service goes down for hours ala Fastly.Or a single corrupt file crashes your entire service.

评论 #27605023 未加载

zamalekalmost 4 years ago

This brings up one of my pet peeves: recursion. Of course there should have been other mitigations in place, but recursion is such a dangerous tool. So far as reasonably possible, I consider its only real purpose to confuse students in 101 courses.I assume that they are using .Net, as SOEs bring down .Net processes. While that sounds like a strange implementation detail, the philosophy of the .Net team has always been "how do you reasonably recover from an stack overflow?" Even in C++ what happens if, for example, the allocator experiences a stack overflow while deallocating some RAII resource, or a finally block calls a function and allocates stack space, or... you get the idea.The obvious thing to do here would be to limit recursion in the library (which amounts to safe recursion usage). BinaryPack does not have a recursion limit option, which makes it unsafe for any untrusted data (and that can include data that you produce, as Bunny experienced). Time to open a PR, I guess.This applies to JSON, too. I would suggest that OP configure their serializer with a limit:[1]: <a href="https://www.newtonsoft.com/json/help/html/MaxDepth.htm" rel="nofollow">https://www.newtonsoft.com/json/help/html/MaxDepth.htm</a>

评论 #27606148 未加载

评论 #27606701 未加载

评论 #27605684 未加载

patrickbollealmost 4 years ago

Great write-up. I've just switched from Cloudinary to Backblaze B2 + Bunny CDN and I am saving a pretty ridiculous amount of money for hosting thousands of customer images.Bunny has a great interface and service; I'm really surprised how little people know about it, I think I discovered it on some 'top 10 CDNs list' that I usually ignore, but the pricing was too good to pass up.The team is really on the ball from what I've seen. Appreciate the descriptive post, folks!

nathanganseralmost 4 years ago

I'm impressed by the transparency and clarity of their explanation! Definitely makes me want to use their solution even though they messed up big times!

zeropalmost 4 years ago

on a different note, this outage news will give them more publicity than the product itself, I believe...

评论 #27602814 未加载

dejangpalmost 4 years ago

Dejan here from bunny.net. I was reading some of the comments, but wasn't sure where to reply, so I guess I'll post some additional details here. I tried to keep the blog post somewhat technical, but not overwhelm non-technical readers.So to add some details, we already use multiple deployment groups (one for each DNS cluster). We always deploy each cluster separately to make sure we're not doing something destructive. Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure (oh look how wrong we were) and was not made redundant, since the rest of the code was supposed to handle it gracefully in case this whole system was offline or broken.It was not my intention to blame the library, obviously this was our own fault, but I must admit we did not expect a stack overflow out of it, which completely obliterated all of the servers immediately when the "non-critical" component got corrupted.This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics. Running a checksum did nothing good here, because the distributed file was perfectly fine. The issue happened when it was being generated, not distributed.Now for the DNS itself, which is a critical part of our infrastructure.We of course operate a staging environment with both automated testing and manual testing before things go live.We also operate multiple deployment groups so separate clusters are deployed first, before others go live, so we can catch issues.We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow.Obviously, I mentioned, we are not perfect, but we are trying to improve on what happened. The biggest flaws we discovered were the reliance on our own infrastructure to handle our own infrastructure deployments.We have code versioning and CI in place as well as the options to do rollbacks as needed. If the issue happened under normal circumstances, we would have the ability to roll back all the software instantly, and maybe experience a 2-5 minute downtime. Instead, we brought down the whole system like dominos because it all relied on each other.Migrating deployment services to third-party solutions is therefore our biggest fix at this point.The reason we are moving away from BinaryPack is because it simply wasn't really providing that much benefit. It was helpful, but it wasn't having a significant impact on the overall behavior, so we would rather stick with something that worked fine for years without issues. As a small team, we don't have the time or resources to spend improving it at this point.I'm somewhat exhausted after yesterday, so I hope this is not super unstructured, but I hope that answers some questions and doesn't create more of them :)If I missed any suggestions or something that was unclear, please let me know. We're actively trying to improve all the processes to avoid similar situations in the future.

评论 #27614365 未加载

评论 #27610061 未加载

评论 #27603547 未加载

评论 #27605707 未加载

评论 #27603275 未加载

评论 #27603322 未加载

slackerIIIalmost 4 years ago

Oh, this is a great writeup. I co-host a podcast on outages, and over and over we see cases where circular dependencies end up making recovery much harder. Also, not using a staged deployment is a recipe for disaster!We just wrapped up the first season, but I'm going to put this on the list of episodes for the second season: <a href="https://downtimeproject.com" rel="nofollow">https://downtimeproject.com</a>.

评论 #27606012 未加载

coroboalmost 4 years ago

I didn't notice but I do appreciate the automatic SLA honouring plus the letting me knowNice work Bunny CDN.

manigandhamalmost 4 years ago

All this focus on redundancy should be replaced with a focus on recovery. Perfect availability is already impossible. For all practical uses, something that recovers within minutes is better than trying to always be online and failing horribly.

评论 #27609640 未加载

qaqalmost 4 years ago

And thanx to the writeup making it to the top of HN now I and prob many more people here learned about the existence of Bunny CDN.

EricEalmost 4 years ago

It’s not DNSThere is a no way it’s DNSIt was DNSOne of the most bittersweet haikus for any sysadmin :p

lclarkmichalekalmost 4 years ago

These follow ups aren't super compelling IMO.> To do this, the first and smallest step will be to phase out the BinaryPack library and make sure we run a more extensive testing on any third-party libraries we work with in the future.Sure. Not exactly a structural fix. But maybe worth doing. Another view would be that you've just "paid" a ton to find issues in the BinaryPack library, and maybe should continue to invest in it.Also, "do more tests" isn't a follow up. What's your process for testing these external libs, if you're making this a core part of your reliability effort?> We are currently planning a complete migration of our internal APIs to a third-party independent service. This means if their system goes down, we lose the ability to do updates, but if our system goes down, we will have the ability to react quickly and reliably without being caught in a loop of collapsing infrastructure.Ok, now tell me how you're going to test it. Changing architectures is fine, but until you're running drills of core services going down, you don't actually know you've mitigated the "loop of collapsing infrastructure" issue.> Finally, we are making the DNS system itself run a local copy of all backup data with automatic failure detection. This way we can add yet another layer of redundancy and make sure that no matter what happens, systems within bunny.net remain as independent from each other as possible and prevent a ripple effect when something goes wrong.Additional redundancy isn't a great way of mitigating issues caused by a change being deployed. Being 10x redundant usually adds quite a lot of complexity, provides less safety than it seems (again, do you have a plan to regularly test that this failover mode is working?) and can be less effective than preventing issues getting to prod.What would be nice to see if a full review of the detection, escalation, remediation and prevention for this incident.More specifically, the triggering event here, the release of a new version of software, isn't super novel. More discussion of follow ups that are systematic improvements to the release process would be useful. Some options:- Replay tests to detect issues before landing changes- Canaries to detect issues before pushing to prod- Gradual deployments to detect issues before they hit 100%- Even better, isolated gradual deployments (i.e. deploy region by region, zone by zone) to mitigate the risk of issues spreading between regions.Beyond that, start thinking about all the changing components of your product, and their lifecycle. It sounds like here some data file got screwed up as it was changed. Do you stage those changes to your data files? Can you isolate regional deployments entirely, and control the rollout of new versions of this data file on a regional basis? Can you do the same for all other changes in your system?

评论 #27602874 未加载

busymom0almost 4 years ago

One of the comments on the post is:> One thing you could do in future is to url redirect any BunnyCDN url back to the clients original url, in essence disabling the CDN and getting your clients own hosts do what they were doing before they connected to BunnyCDN, yes it means our sites won't be as fast but its better than not loading up the files at all. I wonder if that is possible in technical terms?Isn't this a horrible idea? If you use bunny, this would cause a major spike in the traffic and thus costs from your origin server.

评论 #27602746 未加载

评论 #27602805 未加载

评论 #27602545 未加载

path411almost 4 years ago

Sounds like they got really lucky they could get it back up so quickly. They must have some very talented engineers working there.My take aways though were they should have tested the update better. They should have their production environment more segmented with staggered updates so they have much more contained disasters. And they should have had much better catastrophic failure plans in place.

评论 #27602861 未加载

评论 #27603301 未加载

debarshrialmost 4 years ago

I can imagine how stressful the situation was, but it was pleasure to read. It again goes to show you that no matter how prepared, how optimized/over optimized you want to be, there will always be a situation you have never accounted for and sh*t always hits the fan, that is the reality of IT ops.

iJohnDoealmost 4 years ago

Slightly off-topic, has anyone else noticed higher latency with internet traffic going in or out of Germany? Just in general?Frankfurt was mentioned in the post and I immediately thought it would be a bad idea because I’ve always seen USA to Germany traffic have higher latency. Maybe within Europe it’s fine.

xaropealmost 4 years ago

I would think critical systems and updates should also have some form of out-of-band access channel?

TacticalCoderalmost 4 years ago

Slighlty offtopic but what about the big outage from a few days/weeks ago where half the Internet was down (exaggerating only a little bit), has there been a postmortem I missed?

评论 #27602863 未加载

ruudaalmost 4 years ago

Not to say that additional mitigations are inappropriate, but a stack overflow when parsing a corrupt file sounds like something that could have easily been found by a fuzzer.

christophilusalmost 4 years ago

Happy BunnyCDN user here. Thanks for the writeup.> Both SmartEdge and the deployment systems we use rely on Edge Storage and Bunny CDN to distribute data to the actual DNS servers. On the other hand, we just wiped out most of our global CDN capacity.That’s the TLDR. What a stressful couple of hours that must have been for their team.

busymom0almost 4 years ago

> On June 22nd at 8:25 AM UTC, we released a new update designed to reduce the download size of the optimization database.That's around 4:25 a.m EST. Are updates usually done around this time at other companies? Seems like that's cutting pretty close to the around 8am where a lot of employees start working.The details of the whole incident sounds pretty terrifying and I am inspired to hear how much pressure their admins were under and got it working again. Good work.

评论 #27602504 未加载

评论 #27602463 未加载

评论 #27602471 未加载

24 comments

ram_raralmost 4 years ago

评论 #27603168 未加载

评论 #27603898 未加载

评论 #27614102 未加载

YetAnotherNickalmost 4 years ago

评论 #27602993 未加载

评论 #27603917 未加载

评论 #27603061 未加载

stringalmost 4 years ago

评论 #27603082 未加载

foobarbazetcalmost 4 years ago

评论 #27605023 未加载

zamalekalmost 4 years ago

评论 #27606148 未加载

评论 #27606701 未加载

评论 #27605684 未加载

patrickbollealmost 4 years ago

nathanganseralmost 4 years ago

I'm impressed by the transparency and clarity of their explanation! Definitely makes me want to use their solution even though they messed up big times!

zeropalmost 4 years ago

on a different note, this outage news will give them more publicity than the product itself, I believe...

评论 #27602814 未加载

dejangpalmost 4 years ago

评论 #27614365 未加载

评论 #27610061 未加载

评论 #27603547 未加载

评论 #27605707 未加载

评论 #27603275 未加载

评论 #27603322 未加载

slackerIIIalmost 4 years ago

评论 #27606012 未加载

coroboalmost 4 years ago

I didn't notice but I do appreciate the automatic SLA honouring plus the letting me knowNice work Bunny CDN.

manigandhamalmost 4 years ago

评论 #27609640 未加载

qaqalmost 4 years ago

And thanx to the writeup making it to the top of HN now I and prob many more people here learned about the existence of Bunny CDN.

EricEalmost 4 years ago

It’s not DNSThere is a no way it’s DNSIt was DNSOne of the most bittersweet haikus for any sysadmin :p

lclarkmichalekalmost 4 years ago

评论 #27602874 未加载

busymom0almost 4 years ago

评论 #27602746 未加载

评论 #27602805 未加载

评论 #27602545 未加载

path411almost 4 years ago

评论 #27602861 未加载

评论 #27603301 未加载

debarshrialmost 4 years ago

iJohnDoealmost 4 years ago

xaropealmost 4 years ago

I would think critical systems and updates should also have some form of out-of-band access channel?

TacticalCoderalmost 4 years ago

Slighlty offtopic but what about the big outage from a few days/weeks ago where half the Internet was down (exaggerating only a little bit), has there been a postmortem I missed?

评论 #27602863 未加载

ruudaalmost 4 years ago

Not to say that additional mitigations are inappropriate, but a stack overflow when parsing a corrupt file sounds like something that could have easily been found by a fuzzer.