A self-killing web site requested by a customer (2011)

194 pointsby ciprian_craciunabout 5 years ago

11 comments

I found it interesting because such a simple task (requiring at least a number of on-line servers before the load-balancer starts serving requests), required a custom binary controlled the webserver and had to cross-monitor each server.For example with HAProxy (my favorite load-balancer and HTTP "router") this can be easily achieved by using `nbsrv`, creating an ACL and only routing requests to the backend based on that ACL. Based on the documentation bellow:* <a href="http://cbonte.github.io/haproxy-dconv/2.1/configuration.html#7.3.1-nbsrv" rel="nofollow">http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...</a>* <a href="http://cbonte.github.io/haproxy-dconv/2.1/configuration.html#4.2-monitor%20fail" rel="nofollow">http://cbonte.github.io/haproxy-dconv/2.1/configuration.html...</a>One can write this:<pre><code> ~~~~ frontend www mode http acl site_alive nbsrv(dynamic) gt 2 use_backend dynamic if site_alive ~~~~ </code></pre> [This article was linked from the original article described in (<a href="https://news.ycombinator.com/item?id=23099347" rel="nofollow">https://news.ycombinator.com/item?id=23099347</a>).]

评论 #23101922 未加载

评论 #23102177 未加载

nicbouabout 5 years ago

This is called a cascading failure. It's also a problem with the electric grid, and more terrifyingly with global finance.<a href="https://en.wikipedia.org/wiki/Cascading_failure" rel="nofollow">https://en.wikipedia.org/wiki/Cascading_failure</a>

评论 #23102160 未加载

评论 #23102043 未加载

评论 #23103586 未加载

ninkendoabout 5 years ago

Wait, why are the servers “crashing” when under too much load in the first place?If there’s some sort of natural limit to how many simultaneous connections they can handle, why can’t they just return some 4xx error code for connections beyond that? (And have clients implement an exponential back off?)Or if that’s too difficult, the load balancer could keep track of some maximum number of connections (or even requests per second) each backend is capable of, and throttle (again with some 4xx error code) when the limit has been reached by all backends? This is pretty basic functionality for load balancers.You’re going to need actual congestion control anyway, when the number of client connections is unbounded like this. Even when your servers aren’t crashing, what if the client apps whose clicks you’re tracking becomes suddenly more popular and you can’t handle the load even with all of your servers up?

评论 #23101790 未加载

评论 #23104655 未加载

评论 #23102461 未加载

MaxBarracloughabout 5 years ago

> The load now rebalanced to four remaining machines is just far too big, and they all die as a result.Perhaps I'm missing something terribly obvious here, but why would that happen?I can understand requests being dropped and processing-times worsening, but a full system-wide crash?edit My bad, I'd missed this in the article:> they could have rewritten their web site code so it didn't send the machines into a many-GB-deep swap fest. They could have done that without getting any hosting people involved. They didn't, and so now I have a story

评论 #23102213 未加载

评论 #23101934 未加载

评论 #23101955 未加载

评论 #23102054 未加载

klausjensenabout 5 years ago

Lovely hack and an example of how thinking outside the box can create solutions, that are order of magnitude cheaper than the "obvious" solution.

londons_exploreabout 5 years ago

A better solution would be to simply configure the loadbalancer to have a maximum number of requests per second per endpoint and then to drop any requests over that.An even better loadbalancer will poll a load endpoint, representing CPU load, queue length, percentage of time GC'ing, or some similar metric, and scale back requests as that metric gets too high.

评论 #23102522 未加载

mgkimsalabout 5 years ago

was it just a cost thing that would prevent people from just adding another server in to the mix? given that 4 was the magic number, add another server or two to add buffer to time between servers dying and 'it all breaks'? I'm realizing the cost factor may have been it, depending on size/location/etc. - would there be any other reason?

评论 #23102069 未加载

评论 #23104187 未加载

hangonhnabout 5 years ago

Oh boy! I had a similar cascading failure situation once with a Nagios "cluster" I inherited. The previous engineer distributed the work between a master and 3 slave nodes with a backup mechanism such that if any of the slaves died, the load would go to the master. This was fine when he first created it but as more slaves were added, the master was running at capacity just dealing with the incoming data. So each each additional slave node, the probability of one of them failing and sending its load to overwhelm the master increased. Sometimes a poorly designed distributed system is worse than a single big server.I ended up leveraging Consul to do leadership election (only for the alerting bit) and monitor the health of all the nodes in the cluster. If one of them failed, the load was redistributed equally among the remaining nodes.

rjkennedy98about 5 years ago

HA is definitely super tricky. Not many products do it well. One of the last NoSQL databases I used for instance was quicker to restart than for failover to be detected so DBAs would just restart the cluster instead of waiting for failover to happen during an upgrade.

jrockwayabout 5 years ago

There is actually quite a bit of complexity with load balancing, but the good news is that a lot of the complexity is understood and is configurable on the load balancer.I think what Rachel calls a "suicide pact" is now commonly called a circuit breaker. After a certain number of requests fail, the load balancer simply removes all the backends for a certain period of time, and causes all requests to immediately fail. This attempts to mitigate the cascading failure by simply isolating the backend from the frontend for a period of time. If you have something like a "stateless" web-app that shares a database with the other replicas, and the database stops working, this is exactly what you want. No replica will be able to handle the request, so don't send it to any replica.Another option to look into is the balancer's "panic threshold". Normally your load balancer will see which backends are healthy, and only route requests to those. That is what the load balancer in the article did, and the effect was that it overloaded the other backends to the point of failure (and this is a somewhat common failure mode). With a panic threshold set, when that many backends become unhealthy, the balancer stops using health as a routing criterion. It will knowingly send some requests to an unhealthy backend. This means that the healthy backends will receive traffic load that they can handle, so at least (healthy/total)% of requests will be handled successfully (instead of causing a cascading failure).Finally, other posts mention a common case like running ab against apache/mysql/php on a small machine. The OOM eventually kicks in and starts killing things. Luckily, people are also more careful on that front now. Envoy, for example, has the overload manager, so you can configure exactly how much memory you are going to use, and what happens when you get close to the limits. For my personal site, I use 64M of RAM for Envoy, and when it gets to 99% of that, it just stops accepting new connections. This sucks, of course, but it's better than getting OOM killed entirely. (A real website would probably want to give it more than 64M of RAM, but with my benchmarking I can't get anywhere close with 8000 requests/second going through it... and I'm never going to see that kind of load.)I guess the TL;DR is that in 2011 it sounded scary to have a "suicide pact" but now it's normal. Sometimes you've got to kill yourself to save others. If you're a web app, that is.

评论 #23102468 未加载

Random_ernestabout 5 years ago

I am not a webdev, but isn't that a task for the loadbalancer in the first place?

评论 #23102243 未加载