The most important part of this article is the concept of back pressure and being able to detect it. It's common in a ton of other engineering disciplines but especially important when designing fault tolerant or load balancing systems at scale.<p>Basically it is just some type of feedback so that you don't overload subsystems. One of the most common failure modes I see in load balanced systems is when one box goes down the others try to compensate for the additional load. But there is nothing that tells the system overall "hey there is less capacity now because we lost a box". So you overwhelm all the other boxes and then you get this crazy cascade of failures.
Hate to be a party pooper, but I'd like to give people here a more generic mental tool to solve this problem.<p>Ignoring Elixir and Erlang - when you discover you have a backpressure problem, that is - any kind of throttling - connections or req/sec, you need to immediately tell yourself "I need a queue", and more importantly "I need a queue that has a prefetch capabilities". Don't try to build this. Use something that's already solid.<p>I've solved this problems 3 years ago, having 5M msg/minute pushed _reliably_ without loss of messages, and each of these messages were checked against a couple rules for assertion per user (to not bombard users with messages, when is the best time to push to a a user, etc.), so this adds complexity. Later approved messages were bundled into groups of a 1000, and passed on to GCM HTTP (today, Firebase/FCM).<p>I've used Java and Storm and RabbitMQ to build a scalable, dynamic, streaming cluster of workers.<p>You can also do this with Kafka but it'll be less transactional.<p>After tackling this problem a couple times, I'm completely convinced Discord's solution is suboptimal. Sorry guys, I love what you do, and this article is a good nudge for Elixir.<p>On the second time I've solved this, I've used XMPP. I knew there were risks, because essentially I'm moving from a stateless protocol to a stateful protocol. Eventually, it wasn't worth the effort and I kept using the old system.
Quick serious question: How does this company plan to make money? They're surely well funded[1], but what's their end game?<p>[1] "We've raised over $30,000,000 from top VCs in the valley like Greylock, Benchmark, and Tencent. In other words, we’ll be around for a while."
That's awesome and it just goes to show how simple something can be that would otherwise involve a certain degree of concurrent (and distributed) programming.<p>GenStage has a lot of uses at scale. Even more so is going to be GenStage Flow (<a href="https://hexdocs.pm/gen_stage/Experimental.Flow.html" rel="nofollow">https://hexdocs.pm/gen_stage/Experimental.Flow.html</a>). It will be a game changer for a lot of developers.
"Obviously a few notifications were dropped. If a few notifications weren’t dropped, the system may never have recovered, or the Push Collector might have fallen over."<p>How many is a few? It looks like the buffer reaches about 50k, does a few mean literally in the single digits or 100s?
"requests per minute" is such a useless unit of measurement. Please always quote request rates per second (i.e. Hz).<p>Makes me think of the Abraham Simpson quote: "My car gets 40 rods to the hogshead and that's the way I likes it!"
50k seems like a low bar to start losing messages at. If this was done with Celery and a decently sized RabbitMQ box, I would expect it to get into the millions before problems started happening.
I love Discord, and love Elixir too, so this is a pretty great post.<p>Unfortunate that the final bottleneck was an upstream provider, though it's good that they documented rate limits. I feel like my last attempt to find documented rate limits for GCM/APNS was fruitless, perhaps Firebase messaging has improved that?
What is up with Discord? I feel like it's quietly (maybe not so quietly) one of the bigger startups to come out in the last two years.<p>It seems to have totally taken over a space that wasn't even clearly defined before they got there.
I'd like to say that the official performance unit is the "request per second". And its cousin, the requests per second in peak.<p>The average per minute only gets to be used because many systems have so little load that the number per second is negligible.
Anyone know of a equivalent libraries like GenStage for other languages? (Java, NodeJS, etc)<p>I'd definitely be able to put to use things like flow limiters and queuing and such, but none of my company's projects use Elixir :(
I spend a lot of time in the PCMR Discord, which is pretty lively. The technology seems to be solid, while the UI has issues (notifications from half a day ago are really hard to find for example on mobile devices). Otherwise I'm on Discord every day and love using the service. I miss some slack features, but the VOIP is very good.
just wondering, what is the difference if I use two kind of [producer, consumer] message queues (say rabbitmq) instead of this? Does genstage being a erlang system makes a difference?
how does one achieve this in Celery 4? I remember there was a celery "batch" contrib module that allowed this kind of a batching behavior. But i dont see that in 4
> <i>"Firebase requires that each XMPP connection has no more than 100 pending requests at a time. If you have 100 requests in flight, you must wait for Firebase to acknowledge a request before sending another."</i><p>So... get 100 firebase accounts and blast them in parallel.