TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How Discord handles over a million requests per minute with Elixir’s GenStage

382 pointsby Sikulover 8 years ago

19 comments

jtchangover 8 years ago
The most important part of this article is the concept of back pressure and being able to detect it. It&#x27;s common in a ton of other engineering disciplines but especially important when designing fault tolerant or load balancing systems at scale.<p>Basically it is just some type of feedback so that you don&#x27;t overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall &quot;hey there is less capacity now because we lost a box&quot;. So you overwhelm all the other boxes and then you get this crazy cascade of failures.
评论 #13163641 未加载
评论 #13163226 未加载
评论 #13164684 未加载
jondotover 8 years ago
Hate to be a party pooper, but I&#x27;d like to give people here a more generic mental tool to solve this problem.<p>Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req&#x2F;sec, you need to immediately tell yourself &quot;I need a queue&quot;, and more importantly &quot;I need a queue that has a prefetch capabilities&quot;. Don&#x27;t try to build this. Use something that&#x27;s already solid.<p>I&#x27;ve solved this problems 3 years ago, having 5M msg&#x2F;minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase&#x2F;FCM).<p>I&#x27;ve used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.<p>You can also do this with Kafka but it&#x27;ll be less transactional.<p>After tackling this problem a couple times, I&#x27;m completely convinced Discord&#x27;s solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.<p>On the second time I&#x27;ve solved this, I&#x27;ve used XMPP. I knew there were risks, because essentially I&#x27;m moving from a stateless protocol to a stateful protocol. Eventually, it wasn&#x27;t worth the effort and I kept using the old system.
评论 #13165357 未加载
评论 #13170550 未加载
评论 #13165592 未加载
评论 #13166826 未加载
coverbandover 8 years ago
Quick serious question: How does this company plan to make money? They&#x27;re surely well funded[1], but what&#x27;s their end game?<p>[1] &quot;We&#x27;ve raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while.&quot;
评论 #13164052 未加载
评论 #13163306 未加载
评论 #13162950 未加载
评论 #13163092 未加载
评论 #13163677 未加载
评论 #13162858 未加载
评论 #13162914 未加载
评论 #13163032 未加载
评论 #13162931 未加载
评论 #13163046 未加载
poormanover 8 years ago
That&#x27;s awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.<p>GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (<a href="https:&#x2F;&#x2F;hexdocs.pm&#x2F;gen_stage&#x2F;Experimental.Flow.html" rel="nofollow">https:&#x2F;&#x2F;hexdocs.pm&#x2F;gen_stage&#x2F;Experimental.Flow.html</a>). It will be a game changer for a lot of developers.
hotdogsover 8 years ago
&quot;Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over.&quot;<p>How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?
评论 #13162430 未加载
评论 #13168854 未加载
erikbernover 8 years ago
&quot;requests per minute&quot; is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).<p>Makes me think of the Abraham Simpson quote: &quot;My car gets 40 rods to the hogshead and that&#x27;s the way I likes it!&quot;
评论 #13163649 未加载
评论 #13162889 未加载
评论 #13162732 未加载
pwfover 8 years ago
50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.
评论 #13162681 未加载
评论 #13162573 未加载
评论 #13162544 未加载
评论 #13163903 未加载
bpicoloover 8 years ago
I love Discord, and love Elixir too, so this is a pretty great post.<p>Unfortunate that the final bottleneck was an upstream provider, though it&#x27;s good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM&#x2F;APNS was fruitless, perhaps Firebase messaging has improved that?
评论 #13162723 未加载
评论 #13162942 未加载
diminoover 8 years ago
What is up with Discord? I feel like it&#x27;s quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.<p>It seems to have totally taken over a space that wasn&#x27;t even clearly defined before they got there.
评论 #13163021 未加载
评论 #13162727 未加载
评论 #13162655 未加载
user5994461over 8 years ago
I&#x27;d like to say that the official performance unit is the &quot;request per second&quot;. And its cousin, the requests per second in peak.<p>The average per minute only gets to be used because many systems have so little load that the number per second is negligible.
AgentK20over 8 years ago
Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)<p>I&#x27;d definitely be able to put to use things like flow limiters and queuing and such, but none of my company&#x27;s projects use Elixir :(
评论 #13162702 未加载
评论 #13162694 未加载
评论 #13166258 未加载
评论 #13165451 未加载
mevileover 8 years ago
I spend a lot of time in the PCMR Discord, which is pretty lively. The technology seems to be solid, while the UI has issues (notifications from half a day ago are really hard to find for example on mobile devices). Otherwise I&#x27;m on Discord every day and love using the service. I miss some slack features, but the VOIP is very good.
评论 #13163352 未加载
snambiover 8 years ago
million requests per minute, is this a big deal?
评论 #13163335 未加载
评论 #13164348 未加载
manigandhamover 8 years ago
Akka(.NET) or any actor system is a perfect fit for this and brings the same functionality to other languages and frameworks.
评论 #13166946 未加载
sbovover 8 years ago
Is the number of Push Collectors to Pushers constant or can it vary based upon notification load?
评论 #13163048 未加载
rv11over 8 years ago
just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?
评论 #13170598 未加载
sandGorgonover 8 years ago
how does one achieve this in Celery 4? I remember there was a celery &quot;batch&quot; contrib module that allowed this kind of a batching behavior. But i dont see that in 4
IOT_Apprenticeover 8 years ago
Why not use Kafka for back pressure?
imaginenoreover 8 years ago
&gt; <i>&quot;Firebase requires that each XMPP connection has no more than 100 pending requests at a time. If you have 100 requests in flight, you must wait for Firebase to acknowledge a request before sending another.&quot;</i><p>So... get 100 firebase accounts and blast them in parallel.