Problems with low DNS TTLs

185 点作者 JimWestergren大约 4 年前

37 条评论

sparrish大约 4 年前

As a sysadmin with 20+ years experience, I've had long TTLs cause issues on several occasions.I've never regretted a short TTL.

评论 #26623612 未加载

评论 #26622754 未加载

评论 #26622994 未加载

评论 #26626083 未加载

评论 #26622801 未加载

评论 #26623058 未加载

评论 #26625204 未加载

评论 #26631200 未加载

评论 #26622858 未加载

评论 #26625381 未加载

评论 #26623298 未加载

评论 #26623212 未加载

评论 #26623573 未加载

评论 #26625646 未加载

hiq大约 4 年前

I think it'd be more interesting to measure the impact on the end user. The article mentions a drop in queries, but aren't DNS queries a drop in the bucket compared to the size of most web pages anyway? Is the difference really noticeable?Do you get faster web pages if you cache for a longer time? If you do, shouldn't web browsers "soft-invalidate" (use the entry, but update it right after) the cache entry when you're just past TTL and "hard-invalidate" (update it before using) after? Do they do that already?

评论 #26622808 未加载

评论 #26626204 未加载

评论 #26626078 未加载

评论 #26628493 未加载

jrockway大约 4 年前

The article claims that web browsers will automatically pick a healthy backend when you return multiple A records, but the behavior doesn't seem acceptable to me. I was going to post "I've never seen it work", but I just tried it and it does indeed work -- the browser hangs for 30 seconds while it waits for the faulty IP address to time out, and then it eventually tries the other IP address, and it does work. (It then retains its selection for a while; I was too lazy to see what happens if I invert the healthiness of the two backends. I also didn't try more than 2.)I think most people would call a website down if it was just a white screen for 30 seconds, so while it's a nice try on the part of the browsers, you can see why people use short TTLs to get bad backends out of the pool as quickly as possible.

xbar大约 4 年前

Customer-facing DNS should have TTLs on the order of 15 to 30 minutes. Halving those values to estimate TTL value to the end user, you get 7 to 15 minutes of cached DNS. That's about right for most user interactions on the web.Much longer and you run into all the trouble that operators have with keeping DNS accurate. DNS is hard. It is easy to break. And 15 to 30 minutes of waiting is about as much normal human attention span you can apply to a problem that sounds like, "Ok, we're all done, is DNS ok?"5 to 10 minute TTLs only benefit operators. Certainly, any TTL less than 5 minutes is an indicator that your operators have no faith whatsoever in their ability to manage DNS.

yjftsjthsd-h大约 4 年前

Once upon a time, I worked in a saas company that would sometimes switch customers to a new instance of a service by switching DNS records -1. Create instance of service running version n+12. Switch public DNS records to point to new servers3. Wait for TTL to expire4. Turn off old servers(Obviously I'm simplifying; if nothing else there should be testing steps in there)Unless I've missed something, wouldn't the author's suggestion to artificially raise the TTL by ignoring the upstream TTL result in the application breaking for customers if they used a DNS resolver that did this?

评论 #26622494 未加载

2ion大约 4 年前

Maybe for losely coupled systems. Unavoidable in tightly coupled systems because it's a convenient way to do things unless you already have elaborate HA infra and protocols in place.For example, if you offer an "entrypoint" that you can guarantee and technically make to be stable, then use longish TTLs. Anycast IPs are an extreme, but inbetween there are many useful modes of exploiting longish but not too long TTLs.On the other hand, if you implement system failover in a locally redundant system and want to exploit DNS so you don't have to manage additional technology to make an "entrypoint" HA (VRRP, other IP movements, ...), low TTLs are nice. AWS is I think using 5s TTLs on the ElastiCache node's primary DNS names.Finally, 15m max is what I'm comfortable with. Any longer or much longer, and ANY MISTAKE, and you can easily be in a world of hurt. It's no fun sitting out a DNS mistake propagating around the world and the fix lagging behind.And this is only a view on "respectable TTL" values. DNS services like Google's public dns probably ignore any or all TTLs for records they pull, and refresh them as fast as possible anyway, at least according to my observation. In that sense, I doubt that most of the internet is still using "respectable" TTLs --- I suspect most systems will RACE to get new data ASAP.

avidiax大约 4 年前

The problem is that the DNS TTL is a feature designed for a static internet of the 70's or 80's.What this points to is a need for an authenticated DNS pushes for refresh/invalidation.All supporting resolvers could keep a list of supporting clients that were told that "foo is at address 42". If the record changes, the authoritative DNS server sends a DNSSEC signed unsolicited response to all previous requesters to update their records. Obviously the TTL can be extended to keep the cache of requestor IPs reasonably sized.Will this happen? Well, for UDP DNS it depends on DNSSEC, which is already not well supported, and it fixes something that is broken but not terribly so. One could imagine Google arranging this between its DNS resolvers and Chrome, for instance.For DNS over HTTPS, this becomes much more feasible.

评论 #26626411 未加载

评论 #26629938 未加载

askbill大约 4 年前

>The urban legend that DNS-based load balancing depends on TTLs (it doesn’t - since Netscape Navigator, clients pick a random IP from a RR set, and transparently try another one if they can’t connect)That's just not how this works at all. While you could use RR records for this purpose, I believe the author is suggesting that load balancing will happen automatically when the client simply can't connect to one of the addresses. That's not load balancing. That's failover.Additionally, most of the use cases for this that I'm aware of are Cname -> A record. This is to say, this method is being used with precision rather than RR.I agree that running 60 second TTL's regardless of need is inefficient, but at a fast glance, the full argument doesn't hold up for me.

评论 #26624209 未加载

skynet-9000大约 4 年前

This only applies to the first request until the cache expires.If a client makes 50 requests before the cache expires, then those will all be based on the cached result.This is still efficient enough that there's probably no more than a single DNS hit for every web page load, even with a short (say, 5 second) TTL, because most web assets will be loaded within that five second window. (If your web page takes longer than 5 seconds to load, you have far more significant issues than a few UDP DNS requests.)Whether the list of invalid use cases are straw man arguments are left as an exercise to the reader, but this article seems to be arguing only one side of the perfectly valid trade-off between flexibility (low TTL's) versus latency (high TTL's).In other words, if high TTL's are so great and there's no compelling reasons to not use them, why not make them one year? Ten years?On the other hand, many (probably most) applications can probably absorb a five-minute outage without anyone screaming too loudly.Clearly there is a balance between "long" and "short" (probably somewhere between one second and infinity). It's good to think about these things and optimize for lower latency, but if five-minute or longer TTL's simply don't fit your use case, then don't feel bad about it.

gregsadetsky大约 4 年前

I was happy to have a low 10 minute TTL a few days ago when Netlify's apex domain IP address stopped working and I had to change it to the new IP that they announced on their status page...! :-) [0]Netlify's "previous" IP was down for ~4 hours.[0] <a href="https://news.ycombinator.com/item?id=26581027" rel="nofollow">https://news.ycombinator.com/item?id=26581027</a>

eximius大约 4 年前

Okay, I thought this would be little more hyperbolic than it is. TTLs under a minute is a little ridiculous. 5m is plenty long for sessions and plenty short for migrations/recovery/what have you.

meltedcapacitor大约 4 年前

LOL lot of arguments for a feature that makes sysadmin/dev life easy once a year at the expense of degraded user experience every day (lot of sporadically broken ISP etc DNS servers civilians can't be expected to bypass). Digital littering.

评论 #26629727 未加载

gertrunde大约 4 年前

I've seen issues with some DNS caches not honouring the TTLs if they're too short (less than 1 hour iirc, although memory is a bit hazy, it was some years ago) - in particular academic institutions tended to be the biggest culprits for this.

评论 #26626003 未加载

smitop大约 4 年前

CloudFlare has a "Auto" TTL option, which is the default, and required to be used when reverse proxying through CloudFlare. There is nothing magical about "Auto" TTL, though: it appears to literally always be 299 seconds. A lot of low TTLs you see are probably caused by CloudFlare.

评论 #26624627 未加载

cmeacham98大约 4 年前

Doubt long TTLs matter that much, given that plenty of software also has a max TTL value[1], including all popular browsers (Chrome(ium), WebKit aka Safari, Necko aka Firefox, Trident aka IE) and the most popular mobile OS (Android). You could maybe get lucky with some caching on your router, but in my experience cheap consumer routers just act as DNS forwarders and have little to no caching (I could not find any explicit data on this however).1: <a href="https://www.ctrl.blog/entry/dns-client-ttl.html" rel="nofollow">https://www.ctrl.blog/entry/dns-client-ttl.html</a>

ShakataGaNai大约 4 年前

Part of the problem is that so many devices are poorly behaved when it comes to DNS. At one point I worked for a company that had a large mobile app presence. We setup new authoritative name servers to conduct a test for a week or so. After the test was completed we removed the name servers records. A lot of clients went away very quickly... but way more stuck around way longer than they should have.At two months post test, those test servers were still getting some traffic.

karmakaze大约 4 年前

Had to get to the very end to see that 'ridiculously low' was anything shorter than "between 40 minutes (2400 seconds) and 1 hour."No thank you, if there's an outage that needs a DNS update to resolve it, 5 to 15 minutes is much more reasonable.

VectorLock大约 4 年前

If we changed 5 minute TTLs to 1 hour and lost that ability to recover, what would we gain in saved traffic? My guess would be not very much.

antattack大约 4 年前

Short TTL can be used for activity tracking.You can use dnsmasq --min-cache-ttl= to set the minimum.Unfortunately you have to recompile to have a minimum longer than 1h.

notyourday大约 4 年前

The problem with generalities is that they tend to pick the examples that don't generalize well.In the case of github's example the author is fixated on DNS where in reality the DNS entry is entry point into fastly's anycast CDN endpoints where the DNS is used to point into the general direction of the correct anycast entrypoint. Fastly's CTO did a great talk a few years ago about load balancing which addressed the DNS issues based on the actual data they have from the edges that service billions of requests.TL;DR of the DNS portion of that talk is "use as low TTL as you can humanly get away with"

rntksi大约 4 年前

Unrelated to the author's post, but for LetsEncrypt TXT records (to have wildcard SSLs), I've always set the TTL very low (in the 1-2 minutes or so range). This is because when I renew SSLs, I don't want to wait for DNS caching of those TXT records to resolve all over the Internet.I think that doesn't really affect anything traffic-wise. Just a thought I had in mind reading the article.

JimWestergren大约 4 年前

What is the use cases for having the TTL shorter than 5 minutes?

评论 #26622055 未加载

评论 #26622385 未加载

评论 #26622450 未加载

评论 #26623602 未加载

评论 #26622066 未加载

评论 #26622769 未加载

评论 #26622893 未加载

评论 #26622020 未加载

xg15大约 4 年前

Reading the article and then reading the comments is interesting. I guess this is a good example of a feature which in theory would benefit both, users and sites - but which falls flat because it's infeasible for ops.

评论 #26627427 未加载

annoyingnoob大约 4 年前

Interestingly, in my experience there is always a long tail of laggards after IP changes, where some folks do not notice the change for a very long time or at all. Having a long TTL makes this worse/take longer.

speleding大约 4 年前

I noticed Cloudfront sets a TTL of 60 seconds on its distributions and also on the elastic load balancers. You pay for every Route 53 lookup if you have an ALIAS record pointing there, as is typical. So AWS does not have an incentive to set it any higher.But if I understand it correctly, you can point a CNAME with a long TTL to the appropriate cloudfront.net record, and then you only pay for the CNAME one. The cloudfront.net lookup will not cost you anything. But the latency for your users will be worse because it adds a lookup (because an ALIAS record gets resolved without a lookup).

评论 #26627978 未加载

encoderer大约 4 年前

> The urban legend that DNS-based load balancing depends on TTLs (it doesn’t - since Netscape Navigator, clients pick a random IP from a RR set, and transparently try another one if they can’t connect)Sure but if it can connect but then pukes out on something like a bad ssl or broken app, it’s not going back and trying another host.So, when using dns for load balancing, it’s preferable to have a low ttl with a dns record tied to a host health check. If a host goes unhealthy it takes itself out of rotation, auto scaling brings a new one in, and it’s fully warmed up in a minute.

adrianstoll大约 4 年前

In 2016 Dyn DNS suffered a DDOS attack and sites including Twitter and Spotify became inaccessible. Higher TTLs would have extended availability from browsers with cached resource records.

AtNightWeCode大约 4 年前

One purpose for a low TTL in the solutions I have built is that you want to change the IP. So first you hit the DNS. You get an IP from some main location. Then after the first request you figure out where the user is located. Perhaps spins up some container close to the user. Then on consecutive requests you get an IP much closer to the user.Another usage is to load balance out a lot of users to different web nodes for instance.Edit: spelling

hkt大约 4 年前

DNS issues could be operated better by many of those running resolvers, for instance, by keeping caches primed for sites to reduce latency to end users - as opposed to extending TTLs.This is probably the cheapest and best solution available for improving DNS related UX issues, and is likely to be something where a commercial DNS provider might do well.

darylteo大约 4 年前

From my short experience, the issue isn't that "the new service isn't available for the user" but "the new service isn't available FOR THE CLIENT". Cue - "why isn't it up yet" emails/calls with "it will take up to x hours to propagate".

thexa4大约 4 年前

Wouldn't imposing a lower bound on the TTL push more people to using anycast instead?

评论 #26629749 未加载

z3t4大约 4 年前

So you have a high TTL thinking that DNS servers will cache your IP, yeh right, DNS servers like Google DNS will only cache it for a few minutes. Doesn't matter if you have high or low TTL.

billpg大约 4 年前

I've often wished there was a way a web server could respond to requests with "Your DNS is out of date. Use this IP instead".

bvrmn大约 4 年前

I'am running local caching dnsmasq with minimum TTL of 1h. Modern internet experience is really awful without it.

donaldihunter大约 4 年前

I wonder how low TTL compares to browser URL bar queries with respect to impact on DNS user experience.

intricatedetail大约 4 年前

Author probably never had to switch servers because of failure etc and then had to wait 24 hours until the traffic came back up while losing money and getting angry emails from clients who e.g. bought advertising.

评论 #26629222 未加载

jeffbee大约 4 年前

It doesn’t sound like the author has ever operated a large scale service. There are reasons why every big operator has short TTLs and it isn’t because they are stupid.

评论 #26625536 未加载

评论 #26622471 未加载

评论 #26622512 未加载

37 条评论

sparrish大约 4 年前

As a sysadmin with 20+ years experience, I've had long TTLs cause issues on several occasions.I've never regretted a short TTL.

评论 #26623612 未加载

评论 #26622754 未加载

评论 #26622994 未加载

评论 #26626083 未加载

评论 #26622801 未加载

评论 #26623058 未加载

评论 #26625204 未加载

评论 #26631200 未加载

评论 #26622858 未加载

评论 #26625381 未加载

评论 #26623298 未加载

评论 #26623212 未加载

评论 #26623573 未加载

评论 #26625646 未加载

hiq大约 4 年前

评论 #26622808 未加载

评论 #26626204 未加载

评论 #26626078 未加载

评论 #26628493 未加载

jrockway大约 4 年前

xbar大约 4 年前

yjftsjthsd-h大约 4 年前

评论 #26622494 未加载

2ion大约 4 年前

avidiax大约 4 年前

评论 #26626411 未加载

评论 #26629938 未加载

askbill大约 4 年前

评论 #26624209 未加载

skynet-9000大约 4 年前

gregsadetsky大约 4 年前

eximius大约 4 年前

Okay, I thought this would be little more hyperbolic than it is. TTLs under a minute is a little ridiculous. 5m is plenty long for sessions and plenty short for migrations/recovery/what have you.

meltedcapacitor大约 4 年前

评论 #26629727 未加载

gertrunde大约 4 年前

评论 #26626003 未加载

smitop大约 4 年前

评论 #26624627 未加载

cmeacham98大约 4 年前

ShakataGaNai大约 4 年前

karmakaze大约 4 年前

VectorLock大约 4 年前

If we changed 5 minute TTLs to 1 hour and lost that ability to recover, what would we gain in saved traffic? My guess would be not very much.

antattack大约 4 年前

Short TTL can be used for activity tracking.You can use dnsmasq --min-cache-ttl= to set the minimum.Unfortunately you have to recompile to have a minimum longer than 1h.

notyourday大约 4 年前

rntksi大约 4 年前

JimWestergren大约 4 年前

What is the use cases for having the TTL shorter than 5 minutes?

评论 #26622055 未加载

评论 #26622385 未加载

评论 #26622450 未加载

评论 #26623602 未加载

评论 #26622066 未加载

评论 #26622769 未加载

评论 #26622893 未加载

评论 #26622020 未加载

xg15大约 4 年前

评论 #26627427 未加载

annoyingnoob大约 4 年前

speleding大约 4 年前

评论 #26627978 未加载

encoderer大约 4 年前

> The urban legend that DNS-based load balancing depends on TTLs (it doesn’t - since Netscape Navigator, clients pick a random IP from a RR set, and transparently try another one if they can’t connect)Sure but if it can connect but then pukes out on something like a bad ssl or broken app, it’s not going back and trying another host.So, when using dns for load balancing, it’s preferable to have a low ttl with a dns record tied to a host health check. If a host goes unhealthy it takes itself out of rotation, auto scaling brings a new one in, and it’s fully warmed up in a minute.

adrianstoll大约 4 年前

In 2016 Dyn DNS suffered a DDOS attack and sites including Twitter and Spotify became inaccessible. Higher TTLs would have extended availability from browsers with cached resource records.

AtNightWeCode大约 4 年前

hkt大约 4 年前

darylteo大约 4 年前

thexa4大约 4 年前

Wouldn't imposing a lower bound on the TTL push more people to using anycast instead?

评论 #26629749 未加载

z3t4大约 4 年前

So you have a high TTL thinking that DNS servers will cache your IP, yeh right, DNS servers like Google DNS will only cache it for a few minutes. Doesn't matter if you have high or low TTL.

billpg大约 4 年前

I've often wished there was a way a web server could respond to requests with "Your DNS is out of date. Use this IP instead".

bvrmn大约 4 年前

I'am running local caching dnsmasq with minimum TTL of 1h. Modern internet experience is really awful without it.

donaldihunter大约 4 年前

I wonder how low TTL compares to browser URL bar queries with respect to impact on DNS user experience.

intricatedetail大约 4 年前

评论 #26629222 未加载

jeffbee大约 4 年前

It doesn’t sound like the author has ever operated a large scale service. There are reasons why every big operator has short TTLs and it isn’t because they are stupid.

评论 #26625536 未加载

评论 #26622471 未加载

评论 #26622512 未加载