Great article. The Sysdig team really knows how to root cause tough problems. The Sysdig tools can be invaluable for getting and making sense of low level data.<p>If you want to play with ELBs, rolling deploys, connection draining to ECS containers, I humbly submit the open source Convox project I am working on.<p><a href="https://github.com/convox/rack" rel="nofollow">https://github.com/convox/rack</a><p>It sets up a peer reviewed, production tested batteries-included VPC, ECS, ASG, ELB, etc cluster in minutes.<p>If the conclusion of this Sysdig post was that you always need to run 2 instances per AZ for the best reliability, I would strongly consider adding that knowledge into the tools either as a default or a production check.<p>Since it sounds like an ELB bug I'll keep the 3 instances in 3 AZs default.
As a network engineer, I'm constantly having to prove that "it's not the network" so I love reading others' technical analyses of similar things. Great troubleshooting and technical detail in this write-up.
We were told ELBs are explicitly not designed for long-running connections when we ran into this exact same issue so know that you will always be working around this design constraint if you do long-running connections through ELBs.<p>There's another case that the article doesn't really discuss (though the evidence of it is in the beginning when all connections drop simultaneously) where the ELB nodes themselves scale vertically at a particular threshold. I believe the setup described is still vulnerable to those scaling events.
In general, if you are using ELBs you should have at least 2 instances per AZ or cross zone load balancing enabled. I've seen this get teams several times.<p>The other thing to consider when deploying to the cloud with load balancers is to use an immutable architecture. Taking hosts out of service, updating them, and putting them back in service is a bit cumbersome at best and leaves you vulnerable to service outages.
We recently discovered that the NAT Gateway also terminates connections by issuing a RST packet when it receives the next packet for a connection that it believes to have timed out, effectively causing the new request to fail. The previous recommended approach of NATing in VPC was to use NAT instances, which sent FIN packets when the timeout was hit, cleanly closing the connection. That behavior was far better, since it indicated that a new request should re-connect first.<p>AWS Support indicated that this was a feature of the new NAT Gateways, even though it breaks outbound connections made by popular implementations such as the Requests python library's urllib3 connection pools. This is pretty unfortunate, and has been a roadblock in migrating to the NAT Gateways.
Somewhat unrelated to the ELB problem identified, but an alternative solution to the original deployment problem: assuming that the collectors are stateless (seem to be) start off the deployment by spinning up a new collector with the new code installed. Then, proceed with the deployment in the original fashion. Once that's over, kill the extra collector. This will ensure that load is distributed roughly in the same manner, over the same number of nodes during the deployment as before the deployment. Depending on load caused by initiating a connection, more than one extra mode may be utilized. In any case, this is a much simpler approach than baking in application-level connection termination. All for a few extra bucks per deploy and a small amount of engineering time up front.
I don't really see a benefit in updating existing instances in this manner. Launching replacement instances with the new code is much easier for us, and it also provides a super fast means of rollback.
We experienced something like this a long while ago, something like 4-5 years. We still employ our workaround, which is to have a tiny "keepalive" instance in each AZ in the ELB.
When the cloud work as desired, life is grand.<p>But when it doesn't, debugging might actually be simpler with less black boxes between you and the metal.