Reining in the thundering herd: Getting to 80% CPU utilization with Django

160 点作者 domino将近 4 年前

20 条评论

Tangent, but I always had a different understanding of the “thundering herd” problem; that is, if a service is down for whatever reason, and it’s brought back online, it immediately grinds to a halt again because there are a bazillion requests waiting to be handled.And the solution to this problem is to slowly, rate-limited, bring the service back online, rather than letting the whole thundering herd go through the door immediately.

评论 #28195791 未加载

评论 #28192654 未加载

评论 #28195624 未加载

评论 #28193194 未加载

评论 #28195399 未加载

评论 #28195472 未加载

luhn将近 4 年前

Unfortunately HAProxy doesn't buffer requests*, which is necessary for a production deployment of gunicorn. And for anybody using AWS, ALB doesn't buffer requests either. Because of this I'm actually running both HAProxy and nginx in front of my gunicorn instances—nginx in front for request buffering and HAProxy behind that for queuing.If anybody is interested, I've packaged both as Docker containers:HAProxy queuing/load shedding: <a href="https://hub.docker.com/r/luhn/spillway" rel="nofollow">https://hub.docker.com/r/luhn/spillway</a>nginx request buffering: <a href="https://hub.docker.com/r/luhn/gunicorn-proxy" rel="nofollow">https://hub.docker.com/r/luhn/gunicorn-proxy</a>* It does have an http_buffer_request option, but this only buffers the first 8kB (?) of the request.

评论 #28194963 未加载

jhgg将近 4 年前

This is somewhat suspect. At my place of work, we operate a rather large Python API deployment (over an order of magnitude more QPS than the OP's post). However, our setup is... pretty simple. We only run nginx + gunicorn (gevent reactor), 1 master process + 1 worker per vCPU. In-front of that we have an envoy load-balancing tier that does p2c backend selection to each node. I actually think the nginx is pointless now that we're using envoy, so that'll probably go away soon.Works amazingly well! We run our python API tier at 80% target CPU utilization.

评论 #28194604 未加载

评论 #28194633 未加载

kvalekseev将近 4 年前

HAProxy is a beautiful tool but it doesn't buffer requests that is why NGINX is recommended in front of gunicorn otherwise it's suspectible to slowloris attack. So either cloubhouse can be easily DDOS'd right now or they have some tricky setup that prevents slow post reqests reaching gunicorn. In the blog post they don't mention that problem while recommend others to try and replace NGINX with HAPRoxy.

评论 #28191974 未加载

TekMol将近 4 年前

Performance is the only thing that is holding me back to consider Python for bigger web applications.Of the 3 main languages for web dev these days - Python, PHP and Javascript - I like Python the most. But it is scary how slow the default runtime, CPython, is. Compared to PHP and Javascript, it crawls like a snake.Pypy could be a solution as it seems to be about 6x faster on average.Is anybody here using Pypy for Django?Did Clubhouse document somewhere if they are using CPython or Pypy?

评论 #28194762 未加载

评论 #28194296 未加载

评论 #28190907 未加载

评论 #28190910 未加载

评论 #28191081 未加载

petargyurov将近 4 年前

> Which exacerbated another problem: uWSGI is so confusing. It’s amazing software, no doubt, but it ships with dozens and dozens of options you can tweak.I am glad I am not the only one. I've had so many issues with setting up sockets, both with gevent and uWSGI, only to be left even more confused after reading the documentation.

j4mie将近 4 年前

If you’re delegating your load balancing to something else further up the stack and would prefer a simpler WSGI server than Gunicorn, Waitress is worth a look: <a href="https://github.com/pylons/waitress" rel="nofollow">https://github.com/pylons/waitress</a>

tbrock将近 4 年前

Aside: AWS only allows registering 1000 targets in a target group… i wonder if thats the limit they hit. If so, its documented.

tarasglek将近 4 年前

Have to wonder how well haproxy works vs balancing by making gunicorn listen via SO_REUSEPORT and letting the kernel balance instead (ala <a href="https://talawah.io/blog/extreme-http-performance-tuning-one-point-two-million/" rel="nofollow">https://talawah.io/blog/extreme-http-performance-tuning-one-...</a>)

JanMa将近 4 年前

Interesting to read that they are using Unix sockets to send traffic to their backend processes. I know that it's easily done when using HaProxy but I have never read about people using it. I guess the fact that they are not using docker or another container runtime makes sockets rather simple to use.

评论 #28192739 未加载

评论 #28192913 未加载

ram_rar将近 4 年前

> Python's model of running N separate processes for your app is not as unreasonable as people might have you believe! You can achieve reasonable results this way, with a little digging.I have been through this journey, we eventually migrated to Golang and it saved a ton of money and firefighting time. Unfortunately, python community hasnt been able to remove GIL, it has its benefits (especially for single threaded programs), but I believe the cost (lack of concurrent abstractions. async/await doesn't cut it) far outweigh it.Apart from what the article mentions, other low hanging fruits worth exploring are[1] Moving under PyPy (this should give some perf for free)[2] Bifurcate metadata and streaming if not already. All the django CRUD stuff could be one service, but the actual streaming should be separated to another service altogether.

评论 #28199478 未加载

stu2010将近 4 年前

Interesting to see this. It sounds like they're not on AWS, given that they mentioned that having 1000 instances for their production environment made them one of the bigger deployments on their hosting provider.If not for the troubles they experienced with their hosting provider and managing deployments / cutting over traffic, it possibly could have been the cheaper option to just keep horizontally scaling vs putting in the time to investigate these issues. I'd also love to see some actual latency graphs, what's the P90 like at 25% CPU usage with a simple Gunicorn / gevent setup?

评论 #28191947 未加载

dilyevsky将近 4 年前

Kinda funny they decided paying a ton of money to aws was ok but paying for nginx plus was not

评论 #28191032 未加载

评论 #28191080 未加载

评论 #28191057 未加载

评论 #28191281 未加载

vvatsa将近 4 年前

ya, I pretty much agree with 3 suggestions at the end:* use uWSGI (read the docs, so many options...)* use HAProxy, so very very good* scale python apps by using processes.

latchkey将近 4 年前

If it is just a backend, why not port it over to one of the myriad of cloud autoscaling solutions that are out there?The opportunity cost of spending time figuring out why only 29 workers are receiving requests over adding new features that generate more revenue, seems like a quick decision.Personally, I just start off with that now in the first place, the development load isn't any greater and the solutions that are out there are quite good.

评论 #28191002 未加载

评论 #28190965 未加载

评论 #28190968 未加载

trinovantes将近 4 年前

I've always used nginx for my servers. Is HAProxy that much better to consider learning/switching?

lmilcin将近 4 年前

1M requests per minute on 1000 web instances is not an achievement, it is a disaster.It is ridiculous people brag about it.Guys, if you have budget maybe I can help you up this by couple orders of magnitude.

评论 #28193231 未加载

评论 #28193151 未加载

评论 #28191244 未加载

sdze将近 4 年前

use PHP ;)

评论 #28190844 未加载

catillac将近 4 年前

Famous last words, but I get the sense that the need to handle this sort of load on Clubhouse is plateauing and will decline from here. The app seems to have shed all the people that drew other people initially and lost its small, intimate feel and has turned into either crowded rooms where no one can say anything, or hyper specific rooms where no one has anything to say.Good article though! I’ve dealt with these exact issues and they can be very frustrating.

评论 #28195598 未加载

polote将近 4 年前

I wouldn't be very proud of writing an article like that.Usually engineering blogs exists to show that there are fun stuff to do in a company. But here it just seems they have no idea, what they are doing. Which is fine, I'm classifying myself in the same category.Reading the article I don't feel like they have solved their issue, they just created more future problems