Interns with toasters: how I taught people about load balancers

336 pointsby deafcalculusabout 7 years ago

20 comments

snovv_crashabout 7 years ago

This reminds me strongly of another system of distribution, which suffers from the same effect: differential gearboxes in cars.Since the torque to all the wheels is equal, if one wheel slips it very quickly takes all of the power of the engine, since power = rotational velocity * torque. Only if the speeds are similar is the allocation of power to the wheels similar.In cars, the solution is to make sure that the power is distributed fairly evenly even if a wheel loses traction. This used to be done with thick grease inside the differential gearbox to limit the difference in speeds - more recently complex gearboxes and traction control achieve the same thing, but through constantly monitoring the speeds of each wheel and braking wheels that are spinning out of control.It's interesting how such disparate distribution systems seem to have such similar failure modes. I wonder if the two sides could learn something from each other here.

评论 #16896699 未加载

评论 #16898094 未加载

评论 #16896575 未加载

vascoabout 7 years ago

And that's why you always set health checks on servers behind a load balancer that should take bad ones out as soon as possible. You then get another interesting problem, which is if the server is only spitting errors when serving requests, it'll then be marked healthy again and forever toggle between healthy and unhealthy. But that you can also solve.

评论 #16896719 未加载

评论 #16902637 未加载

评论 #16896698 未加载

stephengillieabout 7 years ago

At a real estate webhost in 2014, we had a small web farm behind a single load balancer. I've written some previous posts about the interesting architecture choices made by the lead architect in previous HN posts. Along with these, it was the days of "move fast and break things", so developers got admin access to servers and would develop against live sites. Fun times to keep a web farm online all night long.Partly because it was such a small operation, we heavily instrumented the web servers with PRTG, along with hitting a number of key sites every minute, on each web server. "When XYZRealty goes down, so do all of these other sites!" "We'll put a sensor on XYZRealty."This gave us great data about the health of the servers, including identifying bad apples, and even aiding in performance testing of new modules. We were able to catch memory leaks and processing spikes before they broke our sites. And when 64-bit modules were ready to replace the 32-bit modules, we had baseline data ready to compare and evaluate.Not that this won't scale - quite the contrary. Though it creates some data and requires dedication to maintain.

cecilpl2about 7 years ago

I see that on my continuous integration system. We use Teamcity with ~50 agents for build tasks that take 20-30 minutes. Each agent can only be running a single task.During the day, all agents are busy and the queue fills up with 30 or so pending tasks.If one agent gets into a bad state where, say, it fails to checkout from source control and fails within the first 20 seconds of a build, then it will very quickly chew through the entire queue of pending tasks, failing them all.You'd think the more agents you have the better insulated you are from the failure of a single one, but this particular failure mode actually becomes more common the more agents you add!!!

评论 #16901180 未加载

logronoideabout 7 years ago

To summarize: “implement health checks to take bad servers out of the service pool, always!”Love the very didactic way of writing. Perfect for managers, not interns!

grosjonaabout 7 years ago

That's why I like random load balancing. If each machine is powerful enough and can handle a few thousand users then the distribution averages out.Smart load balancers are only really necessary if you have inefficient servers that can't handle more than 100 connections per second and they're difficult to get right.If you toss a coin 10 times, you're much more likely to get >=80% heads than if you were to toss that coin 1000 times.

评论 #16895812 未加载

评论 #16895800 未加载

评论 #16895449 未加载

评论 #16906943 未加载

indigodaddyabout 7 years ago

Not sure how relevant this is to Rachel's incident, but looks like they use ECMP/BGP->shiv(L4)->proxygen(L7). So hard to believe that health checking wasn't in the mix. If the nodes were passing the health check, but not properly serving requests still, then I'd assume that a post-mortum items would have involved improving health checks.Found this pretty cool presentation/PDF about FB's load-balancing architecture. Stays fairly high level: <a href="http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user.pdf" rel="nofollow">http://www.esnog.net/gore16/gore16-files/Mikel_Billion_user....</a>

lazyantabout 7 years ago

Great "teaching by showing" example but I don't get the initial technical scenario; you have sophisticated (ie, not RR / random) load balancers that keep track of queues in web servers, so they have to get some information back from them (when job was completed, from HTTP response for ex), but somehow don't react to 500 errors? seems like badly configured LBs. Still scenario can be used as teaching or interview question.

评论 #16896618 未加载

评论 #16896283 未加载

pulkitsh1234about 7 years ago

Maybe some people dismissed her problem as 'impossible' because she didn't inform what kind of load balancing technique was the load balancer using?<pre><code> I suspect what happened is that they didn't understand the problem, and so resorted to ineffective means to steer the attention away from their own inadequacies. </code></pre> ^ This statement is kinda harsh.

评论 #16896610 未加载

评论 #16898049 未加载

KaiserProabout 7 years ago

> "some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing"). "This to me is odd. Was this posted when AWS's ELBs were new and shiny?Most of the big failure cases I've dealt with are along these lines.One server does something stupid and gobbles up the world.That being said, this is a very neat way of describing the problem, I shall be referencing this in the future, I might start putting that in an interview question....

评论 #16903584 未加载

hightowkabout 7 years ago

>and so on down the line until everyone had bread. At various points, a toaster would finish and would pop up. ... I'd notice that they were done and would run over to give them more toast.You'd give them more bread, not toast. Toast is already processed bread :).

评论 #16896629 未加载

sjwrightabout 7 years ago

A great story, but I'm distracted by the word caromed. How did I not know about that word? Now I feel the need to inject it into my active lexicon.

ttfleeabout 7 years ago

A protocol without considering the possibility of insane counterpart is doomed to DDoS the systems that utilize it.

unlivingthingabout 7 years ago

If possible can you share the link to your lectures? (:

amingilaniabout 7 years ago

Who is Rachel and how do her short and simple stories always hit the front-page?They're interesting but I always think they're a little _too_ simple. I mean this entire thing can be summed up with:500s (and other errors) are returned faster than processed requests. Load-balancers will find a misbehaving server's queue empty more often and give it all the requests

评论 #16895793 未加载

评论 #16895285 未加载

评论 #16895861 未加载

评论 #16895405 未加载

评论 #16895551 未加载

评论 #16895252 未加载

评论 #16895447 未加载

评论 #16895145 未加载

评论 #16896165 未加载

评论 #16895350 未加载

评论 #16895303 未加载

kelukelugamesabout 7 years ago

>When it landed on certain web sites, some of the commenters dismissed either it ("impossible") or me ("doesn't know anything about load balancing").Does anyone have a link to what she is talking about?

darshitppabout 7 years ago

> This is what happened when one bad web server decided it was going to fail all of its requests, and would do so while incurring the absolute minimum amount of load on itself.Good ELI5 explanation, but it doesn't really explain why the webserver failed the requests as it did. Or maybe I'm missing something?

评论 #16895923 未加载

评论 #16899344 未加载

BillinghamJabout 7 years ago

I don't really see how this is an issue engineers need to be particularly wary of?Firstly, your typical load balancer doesn't work this way anyway. It will just keep feeding requests to the application hosts on a round robin or random basis. Most don't keep track of how busy each instance is.Secondly, any decent (HTTP/layer 7) load balancer will notice if an instance is returning exclusively 5xx errors and will stop routing requests to it. Would fail even the most basic of health checks.

评论 #16895291 未加载

andrew_wc_brownabout 7 years ago

Based on how verbose this article is I think maybe her original article was misunderstood because it was a chore to sort out the junk from the content just like this one.Too much flourishing, poor pacing

评论 #16896591 未加载

评论 #16896545 未加载

scopecreepabout 7 years ago

I feel like if you have to teach them about load balancers as college interns, FB needs to find a better school to pull interns from.

评论 #16899370 未加载

评论 #16895181 未加载

评论 #16895327 未加载