1.1.1.1 outage explanation

459 点作者 adamch将近 7 年前

17 条评论

gcommer将近 7 年前

This is a great write up. It's also why the DNS root servers have a policy of surviving DDoS through massively over-provisioned, multi-org, anycasted redundancy rather this sort of smart DDoS mitigation that drops traffic: DNS is so critical that any risk of dropping real traffic is unacceptable. (obviously, such a scale is impractical for 99% of services)A good takeaway from this outage for the average user would be to make sure that your fallback DNS resolvers are operated by totally separate providers. (eg, configure 1.1.1.1 with 8.8.8.8 as a fallback, rather than 1.1.1.1 and 1.0.0.1) (Edit: fixed cloudflare's secondary address)

评论 #17205542 未加载

评论 #17203397 未加载

评论 #17204528 未加载

评论 #17202806 未加载

评论 #17204498 未加载

saghm将近 7 年前

> Our FRP framework allows us to express this in clear and readable code. For example, this is part of the code responsible for performing DNS attack mitigation: > > def action_gk_dns(...): > [...] > > if port != 53: > return None > > if whitelisted_ip.get(ip): > return None > > if ip not in ANYCAST_IPS: > return None > > [...]What does this code sample have to do with FRP? This code seems extremely trivial and doesn't give any real indication to me why you'd need a framework of any sort. It seems like they really want to emphasize that they use FRP, but this code just seems completely unrelated.

评论 #17208679 未加载

评论 #17206044 未加载

obeattie将近 7 年前

Major credit to Cloudflare for publishing a clear, honest, and detailed description of what happened. I wish more companies would do this.One thing I’d be interested to know more about is why it took 17 minutes to fix. While you can and always should strive to make them less likely, outages are inevitable, so how you respond is crucial. Here the outage was very obviously caused by a deployment that I’d assume was supervised by humans – why did it take 17 minutes to roll back?

评论 #17203541 未加载

评论 #17203481 未加载

drawkbox将近 7 年前

Great information on the outage. Looks like another version 2 syndrome side effect [1].> Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.> What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.> As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!When porting legacy code it is not only important to understand the edge cases and technical debt built up over time, but to test more heavily in production because you never know if you got them all because some smart guy built them long ago and/or there are unknown hacks that were cornerstones of the system for better or worse.Phased and alpha/beta rollouts in an almost A/B testing way is good for replacement systems. Version 2 systems can also add new attack vectors or other single points of failure that aren't as know from legacy problems, the Provision API seems like it is a candidate for that.Over time the Version 2 system will be hardened just before it is EOL and replaced again to fix all the new problems that arise over time. Version 2's do innovate but they also shroud fixing old issues and pain points for new unknown problems.[1] <a href="https://en.wikipedia.org/wiki/Second-system_effect" rel="nofollow">https://en.wikipedia.org/wiki/Second-system_effect</a>

评论 #17205654 未加载

Robin_Message将近 7 年前

What's interesting here is that the automatic cure (DDoS protection) was worse than the disease (even if there was an attack, blocking all access to the DNS servers is potentially worse than letting them get overloaded).I wonder if it would be possible to express the idea that if a block being applied drops traffic well below expected levels, it must be a mistake?

评论 #17206689 未加载

walrus01将近 7 年前

Commendable honesty and level of detail in the public RFO.

frenchie4111将近 7 年前

Understandable outage. I switched back to cloudflare when it came back up, but this did prompt me to drop 8.8.8.8 in as a 3rd fallback.

IronWolve将近 7 年前

Doesn't seem fixed, still can't resolve archive.is cloudflare is giving off cloudflare DNS web errors.<a href="https://i.imgur.com/APpQPTJ.png" rel="nofollow">https://i.imgur.com/APpQPTJ.png</a>

评论 #17207225 未加载

评论 #17206919 未加载

评论 #17208721 未加载

cm2187将近 7 年前

Do they just use python as pseudo code or do they actually run their attack detection in python?

评论 #17203503 未加载

评论 #17203604 未加载

jonnismash将近 7 年前

>The next time we mitigate 1.1.1.1 traffic, we will make sure there is a legitimate attack hitting us.I fucking love these guys

xtrimsky1234将近 7 年前

This downtime annoyed me at a critical time. Even though I think it's great they are transparent about it, I don't see myself using their DNS again.

8_hours_ago将近 7 年前

I wonder if the change was manually reverted after 17 minutes, or if Cloudflare has a system that watches for a spike in failures and automatically reverts the most recent change.

known将近 7 年前

Error 1001 DNS resolution error when trying to access archive.liBut works with 9.9.9.9

throwaway2048将近 7 年前

another decent alternative fallback with the same featureset is quad9<a href="https://www.quad9.net/" rel="nofollow">https://www.quad9.net/</a>

评论 #17202860 未加载

dingo_bat将近 7 年前

TL;DR: we should have used an IP that is not traditionally used for testing and internal stuff by everybody including Cisco.

评论 #17205428 未加载

评论 #17203437 未加载

评论 #17203211 未加载

baby将近 7 年前

I switched to 1.1.1.1 when it was released and since I’ve had multiple issues with free wifis where they would fail to hijack my dns requests to allow me to login to their portal. I Assume this is a good thing but can someone explain to me why this is happening and what’s the state on improving these wifi portals?PS: sorry for hijacking the thread.

评论 #17206291 未加载

评论 #17207278 未加载

评论 #17206299 未加载

decker将近 7 年前

I wonder if they are going to factor the availability into all the blog posts about how fast they are.