Dejan here from bunny.net. I was reading some of the comments, but wasn't sure where to reply, so I guess I'll post some additional details here. I tried to keep the blog post somewhat technical, but not overwhelm non-technical readers.<p>So to add some details, we already use multiple deployment groups (one for each DNS cluster). We always deploy each cluster separately to make sure we're not doing something destructive. Unfortunately this deployment went to a system that we believed was not a critical part of infrastructure (oh look how wrong we were) and was not made redundant, since the rest of the code was supposed to handle it gracefully in case this whole system was offline or broken.<p>It was not my intention to blame the library, obviously this was our own fault, but I must admit we did not expect a stack overflow out of it, which completely obliterated all of the servers immediately when the "non-critical" component got corrupted.<p>This piece of data is highly dynamic and processes every 30 seconds or so based on hundreds of thousands of metrics. Running a checksum did nothing good here, because the distributed file was perfectly fine. The issue happened when it was being generated, not distributed.<p>Now for the DNS itself, which is a critical part of our infrastructure.<p>We of course operate a staging environment with both automated testing and manual testing before things go live.<p>We also operate multiple deployment groups so separate clusters are deployed first, before others go live, so we can catch issues.<p>We do the same for the CDN and always use canary testing if possible. We unfortunately never assumed this piece of software could cause all the DNS servers to stack overflow.<p>Obviously, I mentioned, we are not perfect, but we are trying to improve on what happened. The biggest flaws we discovered were the reliance on our own infrastructure to handle our own infrastructure deployments.<p>We have code versioning and CI in place as well as the options to do rollbacks as needed. If the issue happened under normal circumstances, we would have the ability to roll back all the software instantly, and maybe experience a 2-5 minute downtime. Instead, we brought down the whole system like dominos because it all relied on each other.<p>Migrating deployment services to third-party solutions is therefore our biggest fix at this point.<p>The reason we are moving away from BinaryPack is because it simply wasn't really providing that much benefit. It was helpful, but it wasn't having a significant impact on the overall behavior, so we would rather stick with something that worked fine for years without issues. As a small team, we don't have the time or resources to spend improving it at this point.<p>I'm somewhat exhausted after yesterday, so I hope this is not super unstructured, but I hope that answers some questions and doesn't create more of them :)<p>If I missed any suggestions or something that was unclear, please let me know. We're actively trying to improve all the processes to avoid similar situations in the future.