科技回声

For about the past 72 hours, our EC2 instances have been encountering intermittent DNS resolution failures when talking to the default AWS DNS server, 172.16.0.23.These failures pretty much always occur at around one minute after the hour (i.e., between 12:01 and 12:02, between 1:01 and 1:02, etc.), and the form the failures take is that the AWS nameserver returns SERVFAIL.I have attempted to isolate the problem to AWS's name servers, as opposed to the DNS servers that AWS is speaking to, by running code outside of AWS that looks up exactly the same DNS records at the same time. I have a script which runs out of cron at 1 minute one minute after the hour and spends about 60 seconds repeatedly looking up host names that doesn't exist and reporting if the lookups return SERVFAIL instead of NXDOMAIN. The script returns occasional errors when run in AWS, but returns absolutely no errors when run outside of AWS.The domain I'm looking up records in is hosted by Dyn through their DynECT service, but I'm not sure that's relevant since I've confirmed that the errors only occur when the AWS nameserver is in the loop.Amazon's DNS servers are notoriously unreliable, but we've never seen this particular failure mode before; the usual failure more is that DNS simply doesn't work at all on a particular instance and we have to terminate and replace it. Certainly, we've never seen a failure mode where the all the errors occur on an hourly cycle like this.What I'm looking for from HN is:1) Are you seeing similar behavior in your AWS deployments?2) Would you be able to run a script similar to the one I'm running to find out if you can reproduce the issue?If I can collect evidence that this is happening to lots of people rather than just us, I have a better chance of convincing Amazon to pay attention to it.Thanks!

16 条评论

namecast超过 10 年前

I'm in for #2. And no to #1, but I'm not sure we'd notice if we did.Here's a theory that you might be able to chase down with AWS EC2 support folks:Many, many EC2 instances are either scheduled to be created on the hour (e.g. by cloudformation/knife ec2/whatever) or are running cron jobs that run hourly;EC2 provisioning tasks and cron jobs usually require connections to outside servers - package installs, apt-get updates, sending logs to s3, etc. - and that means looking up hostnames;Lots of hostnames are being looked up on the hour as a result, and <resource X> is being exhausted hourly when a flood of lookups go to the DNS server.Important caveat: resource X may not be the AWS internal DNS server itself! It can be the port it's connected to being saturated, or a particular uplink on a two port portchannel being flaky (and the flakiness is only evident when it's under high load>, or the elastic interface that is attached to the DNS server, or any one of another dozen things.Are you seeing this behavior across multiple AZs and regions, or just one?(This is just a theory, mind you, but I've seen this same behavior when managing other large DNS clusters, and it sounds like a good fit.)

spaceapesam超过 10 年前

Can also confirm. We've been seeing them for over a week, more severely in the last 2 days. It is not specific to AZs or accounts. We know from Amazon that none of the AZs in our accounts overlap.It doesn't seem to be the VPC DHCP Options set assigned recursive resolver having problems as resolution from within the VPC via 8.8.8.8, say, still results in occasional SERVFAILs again zones Route53 is authoritative for.EDIT: some tcpdump confirmation<pre><code> 17:01:27.988201 IP 10.0.0.2.domain > xxxx.54322: 61062 3/0/0 CNAME xxxx., A 10.x.x.x, A 10.x.x.x (151) 17:01:28.278093 IP 10.0.0.2.domain > xxxx.53047: 49767 ServFail 0/0/0 (61)</code></pre>

jik超过 10 年前

I received the following from AWS support at 9:33am US/Eastern today (8+ hours ago): "... We have now been able to reproduce the behavior in tests similar to your scripts to pinpoint where the UDP packets were disappearing, and yesterday evening the team tested a fix that unfortunately had some unexpected problems. I am hesitant to provide any time estimate since any software development has risk, but I'm hopeful it will be fixed today...."The last DNS blip we saw was less than an hour ago, so I don't think it's fixed yet, but the day is not yet over...

jik超过 10 年前

Here's the script I'm running from cron both inside AWS and outside it at one minute past the hour (with our internal DNS domain replaced with example.com):<pre><code> #!/bin/bash tf=/tmp/out.$$ for turn in $(yes | head -60); do start=$(date) if ! host $(uuidgen).example.com 2>&1 | tee $tf | grep -q -s -w NXDOMAIN then end=$(date) echo "$cmd failed from $start to $end:" cat $tf fi sleep 1 done rm -f $tf</code></pre>

cce_超过 10 年前

Hi, here we are seeing this too, starting around September 27. Some of our worker processes have been getting DNS exception notifications, always at 1 minute past the hour, and we had been scratching our head about it. We'll open a ticket with AWS too. Thanks for helping us find out we're not the only ones seeing this!

jik超过 10 年前

It looks like this was fixed on October 7.

hltbra超过 10 年前

I've been experiencing the same issue since the reboot events, but only RDS DNS errors: <a href="https://forums.aws.amazon.com/thread.jspa?messageID=573380" rel="nofollow">https://forums.aws.amazon.com/thread.jspa?messageID=573380</a>

jbarnard超过 10 年前

I heard rumours that it was fixed on reddit, however I'm still seeing the issue. Lookups to my RDS instance from an EC2 instance will fail. The last few failures were at 6:01am, and 7:02am.

jeffbarr超过 10 年前

I have asked the AWS DNS team to take a look at this thread!

评论 #8405324 未加载

评论 #8416188 未加载

Bobbickel超过 10 年前

We isolated it to East 1a zone and have taken our webserver in that zone offline until the issue is cleared up.

评论 #8403287 未加载

mrdavid超过 10 年前

We also see the same exact behavior. I opened a ticket with AWS earlier today and just forwarded them this thread.

评论 #8402587 未加载

davedash超过 10 年前

Yup, a client of mine's nagios keeps fritzing every minute past the hour with DNS issues. Thanks.

helper超过 10 年前

Yes, we're seeing similar failures at 1 minute past the hour.

jnankin超过 10 年前

Yep, seeing this too across multiple machines!

dkuebric超过 10 年前

We're seeing these as well.

评论 #8400774 未加载

jik超过 10 年前

Still broken this morning.

16 条评论

namecast超过 10 年前

spaceapesam超过 10 年前

jik超过 10 年前

cce_超过 10 年前

jik超过 10 年前

It looks like this was fixed on October 7.

hltbra超过 10 年前

jbarnard超过 10 年前

I heard rumours that it was fixed on reddit, however I'm still seeing the issue. Lookups to my RDS instance from an EC2 instance will fail. The last few failures were at 6:01am, and 7:02am.

jeffbarr超过 10 年前

I have asked the AWS DNS team to take a look at this thread!

评论 #8405324 未加载

评论 #8416188 未加载

Bobbickel超过 10 年前

We isolated it to East 1a zone and have taken our webserver in that zone offline until the issue is cleared up.

评论 #8403287 未加载

mrdavid超过 10 年前

We also see the same exact behavior. I opened a ticket with AWS earlier today and just forwarded them this thread.

评论 #8402587 未加载

davedash超过 10 年前

Yup, a client of mine's nagios keeps fritzing every minute past the hour with DNS issues. Thanks.

helper超过 10 年前

Yes, we're seeing similar failures at 1 minute past the hour.

jnankin超过 10 年前

Yep, seeing this too across multiple machines!

dkuebric超过 10 年前

We're seeing these as well.

评论 #8400774 未加载

jik超过 10 年前

Still broken this morning.

Ask HN: Intermittent EC2 DNS failures at 1 minute past the hour

16 条评论

Ask HN: Intermittent EC2 DNS failures at 1 minute past the hour

16 条评论