This reminds me strongly of another system of distribution, which suffers from the same effect: differential gearboxes in cars.<p>Since the torque to all the wheels is equal, if one wheel slips it very quickly takes all of the power of the engine, since power = rotational velocity * torque. Only if the speeds are similar is the allocation of power to the wheels similar.<p>In cars, the solution is to make sure that the power is distributed fairly evenly even if a wheel loses traction. This used to be done with thick grease inside the differential gearbox to limit the difference in speeds - more recently complex gearboxes and traction control achieve the same thing, but through constantly monitoring the speeds of each wheel and braking wheels that are spinning out of control.<p>It's interesting how such disparate distribution systems seem to have such similar failure modes. I wonder if the two sides could learn something from each other here.
And that's why you always set health checks on servers behind a load balancer that should take bad ones out as soon as possible. You then get another interesting problem, which is if the server is only spitting errors when serving requests, it'll then be marked healthy again and forever toggle between healthy and unhealthy. But that you can also solve.
At a real estate webhost in 2014, we had a small web farm behind a single load balancer. I've written some previous posts about the interesting architecture choices made by the lead architect in previous HN posts. Along with these, it was the days of "move fast and break things", so developers got admin access to servers and would develop against live sites. Fun times to keep a web farm online all night long.<p>Partly because it was such a small operation, we heavily instrumented the web servers with PRTG, along with hitting a number of key sites every minute, on each web server. "When XYZRealty goes down, so do all of these other sites!" "We'll put a sensor on XYZRealty."<p>This gave us great data about the health of the servers, including identifying bad apples, and even aiding in performance testing of new modules. We were able to catch memory leaks and processing spikes before they broke our sites. And when 64-bit modules were ready to replace the 32-bit modules, we had baseline data ready to compare and evaluate.<p>Not that this won't scale - quite the contrary. Though it creates some data and requires dedication to maintain.
I see that on my continuous integration system. We use Teamcity with ~50 agents for build tasks that take 20-30 minutes. Each agent can only be running a single task.<p>During the day, all agents are busy and the queue fills up with 30 or so pending tasks.<p>If one agent gets into a bad state where, say, it fails to checkout from source control and fails within the first 20 seconds of a build, then it will very quickly chew through the entire queue of pending tasks, failing them all.<p>You'd think the more agents you have the better insulated you are from the failure of a single one, but this particular failure mode actually becomes more common the more agents you add!!!
To summarize: “implement health checks to take bad servers out of the service pool, always!”<p>Love the very didactic way of writing. Perfect for managers, not interns!
That's why I like random load balancing. If each machine is powerful enough and can handle a few thousand users then the distribution averages out.<p>Smart load balancers are only really necessary if you have inefficient servers that can't handle more than 100 connections per second and they're difficult to get right.<p>If you toss a coin 10 times, you're much more likely to get >=80% heads than if you were to toss that coin 1000 times.
Not sure how relevant this is to Rachel's incident, but looks like they use ECMP/BGP->shiv(L4)->proxygen(L7). So hard to believe that health checking wasn't in the mix. If the nodes were passing the health check, but not properly serving requests still, then I'd assume that a post-mortum items would have involved improving health checks.<p>Found this pretty cool presentation/PDF about FB's load-balancing architecture. Stays fairly high level: <a href="http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user.pdf" rel="nofollow">http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user....</a>
Great "teaching by showing" example but I don't get the initial technical scenario; you have sophisticated (ie, not RR / random) load balancers that keep track of queues in web servers, so they have to get some information back from them (when job was completed, from HTTP response for ex), but somehow don't react to 500 errors? seems like badly configured LBs. Still scenario can be used as teaching or interview question.
Maybe some people dismissed her problem as 'impossible' because she didn't inform what kind of load balancing technique was the load balancer using?<p><pre><code> I suspect what happened is that they didn't understand the problem, and so resorted to ineffective means to steer
the attention away from their own inadequacies.
</code></pre>
^ This statement is kinda harsh.
> "some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing"). "<p>This to me is odd. Was this posted when AWS's ELBs were new and shiny?<p>Most of the big failure cases I've dealt with are along these lines.<p>One server does something stupid and gobbles up the world.<p>That being said, this is a very neat way of describing the problem, I shall be referencing this in the future, I might start putting that in an interview question....
>and so on down the line until everyone had bread. At various points, a toaster would finish and would pop up. ... I'd notice that they were done and would run over to give them more toast.<p>You'd give them more bread, not toast. Toast is already processed bread :).
A great story, but I'm distracted by the word <i>caromed.</i> How did I not know about that word? Now I feel the need to inject it into my active lexicon.
Who is Rachel and how do her short and simple stories always hit the front-page?<p>They're interesting but I always think they're a little _too_ simple. I mean this entire thing can be summed up with:<p><i>500s (and other errors) are returned faster than processed requests. Load-balancers will find a misbehaving server's queue empty more often and give it all the requests</i>
>When it landed on certain web sites, some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing").<p>Does anyone have a link to what she is talking about?
> This is what happened when one bad web server decided it was going to fail all of its requests, and would do so while incurring the absolute minimum amount of load on itself.<p>Good ELI5 explanation, but it doesn't really explain why the webserver failed the requests as it did. Or maybe I'm missing something?
I don't really see how this is an issue engineers need to be particularly wary of?<p>Firstly, your typical load balancer doesn't work this way anyway. It will just keep feeding requests to the application hosts on a round robin or random basis. Most don't keep track of how busy each instance is.<p>Secondly, any decent (HTTP/layer 7) load balancer will notice if an instance is returning exclusively 5xx errors and will stop routing requests to it. Would fail even the most basic of health checks.
Based on how verbose this article is I think maybe her original article was misunderstood because it was a chore to sort out the junk from the content just like this one.<p>Too much flourishing, poor pacing