科技回声

13 条评论

jcollins超过 10 年前

We saw this earlier this year after upgrading to a new Linux kernel.The solution for us was to set this in sysctl.conf:net.ipv4.neigh.default.gc_thresh1=0<a href="https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150/" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150...</a> <a href="https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150/comments/12" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150...</a>

danesparza超过 10 年前

Am I weird because I actually muttered 'ARP caching issue' halfway through your article? :-)Love the technical write-up -- thanks!

评论 #8730928 未加载

评论 #8731520 未加载

评论 #8734616 未加载

评论 #8731387 未加载

评论 #8731933 未加载

评论 #8731478 未加载

评论 #8732143 未加载

评论 #8731179 未加载

ChuckMcM超过 10 年前

Interesting trace down to stale ARP entries. It gets worse when the switches are running mac address filtering and they get out of date. We had that issue with some Blade G8052 top of rack switches with their upstream 10G ports. They sometimes "forget" which upstream port has the MAC address that they are switching too, and those packets just spew out messily into the data center leaving a mess. The "fix" it to force the switch to ping up through a specific upstream port periodically to the center switch's management IP address. Sigh.

评论 #8731266 未加载

spectre256超过 10 年前

This reminds me of a time at a previous company years ago, where we experienced an issue that felt similar, although the root cause was quite different.Basically, we had multiple teams all launching/terminating web servers. Unfortunately, they were all in the same EC2 deployment, and more often than not our load balancers from one team would send traffic to the web servers of another team. Furthermore, our setups were similar enough that this would sometimes cause bad results for users. We fixed it by making sure that our web servers on every team spoke on different ports. Not elegant, but effective (until two teams accidentally picked the same ports).These days we have good enough infrastructure tools that this problem should never happen. But in 2009, at a company that was overwhelmed with growth, those sort of things happen.

评论 #8731315 未加载

falcolas超过 10 年前

Try an arping from the new workers on first startup? Ran into this quite a bit when using VIPs for DB failover, and an arping fixed the caching issue in most cases.

schimmy_changa超过 10 年前

I think the biggest thing I was surprised by with this investigation was the lack of documentation about data-layer tools. At one point I was looking through the source of the 'ip' command to try to find out exactly which conditions caused a 'STALE' entry in the ARP table...

maerF0x0超过 10 年前

I wonder if this is a problem for any cloud provider, I also wonder if ipv6 could help mitigate? Then the IP collusions would be rarer.

评论 #8732356 未加载

评论 #8730785 未加载

评论 #8730803 未加载

评论 #8731198 未加载

wahnfrieden超过 10 年前

FYI, Clever: I click "Engineering Blog" at the top, and all links to blog posts on that page 404.

评论 #8731082 未加载

girvo超过 10 年前

We had a fascinating bug on EC2 -- we could connect to the instance, but no network traffic made it out. It wasn't security group problems, it was literally a really weird bug in EC2's network that we somehow triggered, the engineer over at Amazon that looked at it was really excited when he came across our case as it was so weird, heh. They fixed it, I can't remember exactly what was done on their end, but it was one of the weirder problems I've attempted to debug. Nothing I tried worked!

perlgeek超过 10 年前

Wouldn't it be a better solution to not reuse IP addresses quickly? If I understood it correctly, they are in a private network anyway, so they could afford it.

评论 #8731549 未加载

zenocon超过 10 年前

I just experienced this early this week. Very frustrating. I also posted to AWS forums and got zero assistance; am currently not paying for AWS support plan. This article came at an opportune moment -- it makes sense and removes the shroud of mystery around why it "works sometimes" which leaves me with an uneasy feeling for a production setup.

评论 #8732386 未加载

kiyoto超过 10 年前

Looking at the port number, it looks like Clever is a MongoDB user =)

评论 #8732080 未加载

legohead超过 10 年前

I was going to respond with my little story, but I see your article already linked it! ;)

13 条评论

jcollins超过 10 年前

danesparza超过 10 年前

Am I weird because I actually muttered 'ARP caching issue' halfway through your article? :-)Love the technical write-up -- thanks!

评论 #8730928 未加载

评论 #8731520 未加载

评论 #8734616 未加载

评论 #8731387 未加载

评论 #8731933 未加载

评论 #8731478 未加载

评论 #8732143 未加载

评论 #8731179 未加载

ChuckMcM超过 10 年前

评论 #8731266 未加载

spectre256超过 10 年前

评论 #8731315 未加载

falcolas超过 10 年前

Try an arping from the new workers on first startup? Ran into this quite a bit when using VIPs for DB failover, and an arping fixed the caching issue in most cases.

schimmy_changa超过 10 年前

maerF0x0超过 10 年前

I wonder if this is a problem for any cloud provider, I also wonder if ipv6 could help mitigate? Then the IP collusions would be rarer.

评论 #8732356 未加载

评论 #8730785 未加载

评论 #8730803 未加载

评论 #8731198 未加载

wahnfrieden超过 10 年前

FYI, Clever: I click "Engineering Blog" at the top, and all links to blog posts on that page 404.

评论 #8731082 未加载

girvo超过 10 年前

perlgeek超过 10 年前

Wouldn't it be a better solution to not reuse IP addresses quickly? If I understood it correctly, they are in a private network anyway, so they could afford it.

评论 #8731549 未加载

zenocon超过 10 年前

评论 #8732386 未加载

kiyoto超过 10 年前

Looking at the port number, it looks like Clever is a MongoDB user =)

评论 #8732080 未加载

legohead超过 10 年前

I was going to respond with my little story, but I see your article already linked it! ;)

When your IP traffic in AWS disappears into a black hole

13 条评论

When your IP traffic in AWS disappears into a black hole

13 条评论