TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How we found a bug in Amazon ELB

151 pointsby davideschieraabout 9 years ago

14 comments

nzoschkeabout 9 years ago
Great article. The Sysdig team really knows how to root cause tough problems. The Sysdig tools can be invaluable for getting and making sense of low level data.<p>If you want to play with ELBs, rolling deploys, connection draining to ECS containers, I humbly submit the open source Convox project I am working on.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;convox&#x2F;rack" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;convox&#x2F;rack</a><p>It sets up a peer reviewed, production tested batteries-included VPC, ECS, ASG, ELB, etc cluster in minutes.<p>If the conclusion of this Sysdig post was that you always need to run 2 instances per AZ for the best reliability, I would strongly consider adding that knowledge into the tools either as a default or a production check.<p>Since it sounds like an ELB bug I&#x27;ll keep the 3 instances in 3 AZs default.
评论 #11591809 未加载
jlgaddisabout 9 years ago
As a network engineer, I&#x27;m constantly having to prove that &quot;it&#x27;s not the network&quot; so I love reading others&#x27; technical analyses of similar things. Great troubleshooting and technical detail in this write-up.
评论 #11593559 未加载
azundoabout 9 years ago
We were told ELBs are explicitly not designed for long-running connections when we ran into this exact same issue so know that you will always be working around this design constraint if you do long-running connections through ELBs.<p>There&#x27;s another case that the article doesn&#x27;t really discuss (though the evidence of it is in the beginning when all connections drop simultaneously) where the ELB nodes themselves scale vertically at a particular threshold. I believe the setup described is still vulnerable to those scaling events.
评论 #11590903 未加载
评论 #11594232 未加载
djb_hackernewsabout 9 years ago
In general, if you are using ELBs you should have at least 2 instances per AZ or cross zone load balancing enabled. I&#x27;ve seen this get teams several times.<p>The other thing to consider when deploying to the cloud with load balancers is to use an immutable architecture. Taking hosts out of service, updating them, and putting them back in service is a bit cumbersome at best and leaves you vulnerable to service outages.
评论 #11592122 未加载
评论 #11592382 未加载
narsilabout 9 years ago
We recently discovered that the NAT Gateway also terminates connections by issuing a RST packet when it receives the next packet for a connection that it believes to have timed out, effectively causing the new request to fail. The previous recommended approach of NATing in VPC was to use NAT instances, which sent FIN packets when the timeout was hit, cleanly closing the connection. That behavior was far better, since it indicated that a new request should re-connect first.<p>AWS Support indicated that this was a feature of the new NAT Gateways, even though it breaks outbound connections made by popular implementations such as the Requests python library&#x27;s urllib3 connection pools. This is pretty unfortunate, and has been a roadblock in migrating to the NAT Gateways.
评论 #11596609 未加载
seliopouabout 9 years ago
Somewhat unrelated to the ELB problem identified, but an alternative solution to the original deployment problem: assuming that the collectors are stateless (seem to be) start off the deployment by spinning up a new collector with the new code installed. Then, proceed with the deployment in the original fashion. Once that&#x27;s over, kill the extra collector. This will ensure that load is distributed roughly in the same manner, over the same number of nodes during the deployment as before the deployment. Depending on load caused by initiating a connection, more than one extra mode may be utilized. In any case, this is a much simpler approach than baking in application-level connection termination. All for a few extra bucks per deploy and a small amount of engineering time up front.
评论 #11591615 未加载
earless1about 9 years ago
I don&#x27;t really see a benefit in updating existing instances in this manner. Launching replacement instances with the new code is much easier for us, and it also provides a super fast means of rollback.
评论 #11591242 未加载
评论 #11591520 未加载
评论 #11591362 未加载
bhzabout 9 years ago
We experienced something like this a long while ago, something like 4-5 years. We still employ our workaround, which is to have a tiny &quot;keepalive&quot; instance in each AZ in the ELB.
DanielDentabout 9 years ago
When the cloud work as desired, life is grand.<p>But when it doesn&#x27;t, debugging might actually be simpler with less black boxes between you and the metal.
jdreaverabout 9 years ago
Hmm, this seems like a pretty big bug in connection draining. I feel like one instance per AZ is a pretty common scenario. Great article!
评论 #11590786 未加载
simonebrunozziabout 9 years ago
This is a great article to read.<p>The author mentions WireShark - fun fact: the founder of Sysdig, Loris, is also the creator of WireShark.
评论 #11593748 未加载
atomicbeanieabout 9 years ago
Nice article.
stevesun21about 9 years ago
Great work done!
sivalingamabout 9 years ago
wonderful debugged the issue.