How Discord handles over a million requests per minute with Elixir’s GenStage

382 pointsby Sikulover 8 years ago

19 comments

jtchangover 8 years ago

The most important part of this article is the concept of back pressure and being able to detect it. It's common in a ton of other engineering disciplines but especially important when designing fault tolerant or load balancing systems at scale.Basically it is just some type of feedback so that you don't overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall "hey there is less capacity now because we lost a box". So you overwhelm all the other boxes and then you get this crazy cascade of failures.

评论 #13163641 未加载

评论 #13163226 未加载

评论 #13164684 未加载

jondotover 8 years ago

Hate to be a party pooper, but I'd like to give people here a more generic mental tool to solve this problem.Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req/sec, you need to immediately tell yourself "I need a queue", and more importantly "I need a queue that has a prefetch capabilities". Don't try to build this. Use something that's already solid.I've solved this problems 3 years ago, having 5M msg/minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase/FCM).I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.You can also do this with Kafka but it'll be less transactional.After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.On the second time I've solved this, I've used XMPP. I knew there were risks, because essentially I'm moving from a stateless protocol to a stateful protocol. Eventually, it wasn't worth the effort and I kept using the old system.

评论 #13165357 未加载

评论 #13170550 未加载

评论 #13165592 未加载

评论 #13166826 未加载

coverbandover 8 years ago

Quick serious question: How does this company plan to make money? They're surely well funded[1], but what's their end game?[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."

评论 #13164052 未加载

评论 #13163306 未加载

评论 #13162950 未加载

评论 #13163092 未加载

评论 #13163677 未加载

评论 #13162858 未加载

评论 #13162914 未加载

评论 #13163032 未加载

评论 #13162931 未加载

评论 #13163046 未加载

poormanover 8 years ago

That's awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (<a href="https://hexdocs.pm/gen_stage/Experimental.Flow.html" rel="nofollow">https://hexdocs.pm/gen_stage/Experimental.Flow.html</a>). It will be a game changer for a lot of developers.

hotdogsover 8 years ago

"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?

评论 #13162430 未加载

评论 #13168854 未加载

erikbernover 8 years ago

"requests per minute" is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"

评论 #13163649 未加载

评论 #13162889 未加载

评论 #13162732 未加载

pwfover 8 years ago

50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.

评论 #13162681 未加载

评论 #13162573 未加载

评论 #13162544 未加载

评论 #13163903 未加载

bpicoloover 8 years ago

I love Discord, and love Elixir too, so this is a pretty great post.Unfortunate that the final bottleneck was an upstream provider, though it's good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM/APNS was fruitless, perhaps Firebase messaging has improved that?

评论 #13162723 未加载

评论 #13162942 未加载

diminoover 8 years ago

What is up with Discord? I feel like it's quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.It seems to have totally taken over a space that wasn't even clearly defined before they got there.

评论 #13163021 未加载

评论 #13162727 未加载

评论 #13162655 未加载

user5994461over 8 years ago

I'd like to say that the official performance unit is the "request per second". And its cousin, the requests per second in peak.The average per minute only gets to be used because many systems have so little load that the number per second is negligible.

AgentK20over 8 years ago

Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(

评论 #13162702 未加载

评论 #13162694 未加载

评论 #13166258 未加载

评论 #13165451 未加载

mevileover 8 years ago

I spend a lot of time in the PCMR Discord, which is pretty lively. The technology seems to be solid, while the UI has issues (notifications from half a day ago are really hard to find for example on mobile devices). Otherwise I'm on Discord every day and love using the service. I miss some slack features, but the VOIP is very good.

评论 #13163352 未加载

snambiover 8 years ago

million requests per minute, is this a big deal?

评论 #13163335 未加载

评论 #13164348 未加载

manigandhamover 8 years ago

Akka(.NET) or any actor system is a perfect fit for this and brings the same functionality to other languages and frameworks.

评论 #13166946 未加载

sbovover 8 years ago

Is the number of Push Collectors to Pushers constant or can it vary based upon notification load?

评论 #13163048 未加载

rv11over 8 years ago

just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?

评论 #13170598 未加载

sandGorgonover 8 years ago

how does one achieve this in Celery 4? I remember there was a celery "batch" contrib module that allowed this kind of a batching behavior. But i dont see that in 4

IOT_Apprenticeover 8 years ago

Why not use Kafka for back pressure?

imaginenoreover 8 years ago

> "Firebase requires that each XMPP connection has no more than 100 pending requests at a time. If you have 100 requests in flight, you must wait for Firebase to acknowledge a request before sending another."So... get 100 firebase accounts and blast them in parallel.