What queues do is smooth out the mismatch between supply and demand. If the mismatch lasts long enough, the queue will overflow, and then you need to load shed (and you need to plan for what the least bad way of load shedding is). But queues do increase overall throughput, up to a point. If the demand was bursty on short timescales and you only allow a small queue to build before load-shedding, you may be wasting capacity. Increasing the allowed queue size lets you smooth out the demand, and so reduce the amount you need to load shed. But the crucial thing here is you're trading off latency against capacity. If that latency increase itself has no significant downside, then you should probably provision the queues so that they can grow until the latency <i>does</i> start to have a downside, because below that point the increase in capacity is worth it. Above that point, the latency itself is a problem, and you should be load-shedding rather than queuing. The trick with any queuing system is to understand <i>when</i> this transition should take place.<p>Edit: I should add that standing queues serve no purpose. If the demand is such that you're persistently queuing for a long time without the queue ever going idle, your transition between queuing and load-shedding is in the wrong place, and you should increase load-shedding, because you're adding latency for no reason.
This is a weird article because it points out that queues don’t solve overload but neither do load shedding or back pressure.<p>All 3 techniques are just different trade offs on what to do in the face of overload. All 3 have negative ramifications for the users of the system. Load shedding reduces availability, back pressure increases complexity and queues increase latency.<p>In “critical” systems you need all 3. And all 3 can be overloaded. Frankly, your load shedding or back pressure system is probably implemented on a queue one layer down the abstraction.
Basically Little's law. It is queues all the way down.<p><a href="https://en.wikipedia.org/wiki/Little's_law" rel="nofollow">https://en.wikipedia.org/wiki/Little's_law</a><p>Additionally, here is a great talk on queuing theory and load shedding. One argument this talk makes is that autoscaling is not the silver bullet you think it is (similar to queues).<p><a href="https://www.youtube.com/watch?v=-oQl1xv0hDk" rel="nofollow">https://www.youtube.com/watch?v=-oQl1xv0hDk</a>
Queues aren't really the problem here.<p>It's that the people making changes don't have a decent understanding of the system they are trying to fix. If you don't actually know what the problem is, your fix is not likely to work.<p>I have seen groups put huge amounts of work into a "fix" for a system when they are only really guessing at what the problem is. (Besides queues, people also seem to like adding "caches" -- often stateful and with no validation strategy so not really caches -- or really any kind of wrapper or shim.)<p>I think it is really useful to raise awareness of this kind of thing, but I think the article could put it a little better. For one thing, queues can "fix overload" depending on what you mean by that. They don't increase system capacity but can let a system handle bursty usage better.
The other thing to bear in mind about queues is that once they start showing of symptoms of something being wrong, collapse might be just around the corner or it might not be depending on the nature of the load.<p>When congestion spikes start showing it is helpful to know some queuing theory to estimate how close the situation is to eating someone's weekend. Congestion collapses are an interesting time because most people don't know queue theory or how to reason using balance equations, it is possible to misdiagnose the problem or waste a stressful few days trying to work out a congestion situation by experiment.
Ring buffers and clever batching abstractions can help get you much closer to ideal.<p>> When consumers are waiting on an advancing cursor sequence in the ring buffer an interesting opportunity arises that is not possible with queues. If the consumer finds the ring buffer cursor has advanced a number of steps since it last checked it can process up to that sequence without getting involved in the concurrency mechanisms. This results in the lagging consumer quickly regaining pace with the producers when the producers burst ahead thus balancing the system. This type of batching increases throughput while reducing and smoothing latency at the same time. Based on our observations, this effect results in a close to constant time for latency regardless of load, up until the memory sub-system is saturated, and then the profile is linear following Little’s Law [6]. This is very different to the “J” curve effect on latency we have observed with queues as load increases.<p><a href="https://lmax-exchange.github.io/disruptor/disruptor.html#_batching_effect" rel="nofollow">https://lmax-exchange.github.io/disruptor/disruptor.html#_ba...</a>
The US Veterans Affairs system has waiting lists for medical care. Due to Congressional oversight the length of this waiting list was scrutinized. Adding capacity through hiring more medical personnel takes more budget and time. Load shedding through not letting people get on the wait list is unacceptable. So the bureaucracy added an additional buffer. They added patients to a secret overflow wait list and moved them to the official wait list only when it was not too long. <a href="https://www.cnn.com/2014/04/23/health/veterans-dying-health-care-delays/index.html" rel="nofollow">https://www.cnn.com/2014/04/23/health/veterans-dying-health-...</a>
Depends.<p>An overflowing queue that drops jobs from the front while backend chugs along as fast as it can, is for many cases a better outcome than the backend overloading and everything grinding to a halt.<p>Compare it to a physical service desk. The one that has a queue, serves one person at a time, and people arriving will try again another day if the queue is too long. The one without queue has people fistfighting over who gets to go first, and no one ends up getting service.
(2014)<p>Fred is a big Erlang advocate from 10+ years ago.<p>He's written multiple books on Erlang, from his real-world experience, of using it to build Heroku (Erlang underpins much of the original Heroku platform and large pieces still today).<p>Much if his writing is influenced by said experience and Erlang native queues.<p><a href="https://www.erlang-in-anger.com" rel="nofollow">https://www.erlang-in-anger.com</a><p><a href="https://learnyousomeerlang.com" rel="nofollow">https://learnyousomeerlang.com</a>
Really good thread. Several comments:<p>Flow Queuing allows applications not creating queues to bypass those that are. It is mildly different from fair queuing: <a href="https://ieeexplore.ieee.org/document/8469111" rel="nofollow">https://ieeexplore.ieee.org/document/8469111</a><p>Load shedding, at least for packets, benefits from head drop more than tail drop. Only the codel algorithm does this but codel has been applied to other forms of queuing like Uber dealing with a concertgoer overload.<p>Kathie Nichols has given some great presentations lately: <a href="https://www.understandinglatency.com/" rel="nofollow">https://www.understandinglatency.com/</a><p>There are a lot of unbounded queues in many applications & libraries, even in rust. They make me twitchy. Even with bounded queues the number chosen is usually arbitrary. I wish a timed queue was a default rather than a length.<p>I highly recommend Kleinrocks work on these subjects.<p>I am grumpy big vendors like juniper have yet to adopt smart queues... despite the obvious benefits. <a href="https://blog.cerowrt.org/post/juniper/" rel="nofollow">https://blog.cerowrt.org/post/juniper/</a>
Been doing a lot of work on electronics and I think Queue's are very similar in a system to capacitors. They smooth out load either by buffering new work/load or holding onto pending work/load...
Ultimately if you over load, you overload. A system has a capacity, and a load. If the load is over the capacity, then not everything is gonna happen. This seems like a pretty basic law of, like, doing stuff.<p>[reads more]<p>Oh, OK that's a lot of words to say "benchmark to find bottlenecks" and "focus on bottlenecks when optimising" but I understand, sometimes it takes a lot of words to get this simple idea across. There's a few times over the years since it was published that this article would have been perfect to send to a special someone who keeps insisting we implement Merkle trees or whatever in order to speed up a product that reads in all its files using<p><pre><code> while (fscanf(cur_file, "%c", ch)) {
// process character here
}</code></pre>
See also: RFC 970: On Packet Switches With Infinite Storage [1]<p>[1] <a href="https://www.rfc-editor.org/rfc/rfc970.html" rel="nofollow">https://www.rfc-editor.org/rfc/rfc970.html</a>
> <i>All of a sudden, the buffers, queues, whatever, can't deal with it anymore. You're in a critical state where you can see smoke rising from your servers, or if in the cloud, things are as bad as usual, but more!</i><p>There's a valid point here, which is that queues can mask problems. Everything seems fine for a while. Until suddenly it isn't.<p>Queues take away important feedback about load. Without feedback, you don't know that problems are looming. You also may falsely believe you've fixed something when you really haven't.<p>But, there's a solution: monitoring and alerting on the queue. Don't alert when it is just about full or just about as bad as you can allow. Alert as soon as it starts to grow beyond normal.<p>Some possible metrics:<p>(1) Number of enqueued items exceeds some threshold. Simple, but you may have to readjust the threshold from time to time. (And if the threshold is too low, people may learn to ignore this alert.)<p>(2) Length of time the most recently dequeued item had been sitting in the queue. When you dequeue an item, if it's been in there for hours (and you were expecting minutes), something is probably wrong.<p>(3) How long it has been since the queue was (within epsilon of) empty. If you kinda mostly need items to be processed immediately and the queue is a fallback, it shouldn't be used very often or for long stretches of time. (You could also alert on what percentage of the time, over some time window, it was / wasn't empty.)<p>(4) How long it has been since a worker (successfully) processed an item taken from the queue. If all your workers die, you might as well know about it right now. (You need to somehow account for empty queues, i.e. workers not doing anything because there's no work to do.)
So far as I know there is no theoretical alternative to load shedding or increasing handling capacity if your average request arrival rate is greater than your average request handling rate. At least, not if you want to handle every accepted request using a finite queue[1]. It would appear that with an unbounded queue every request will eventually be handled, but with an unbounded latency guarantee. Which appears equivalent to “grinds to a halt” for sufficient n.<p>However, that may very well change with fair queueing. If unbounded queue storage is available and you can prioritize requests in a timely fashion, then instead of shedding excess requests they can go into unbounded queue limbo instead while still meeting the SLA for priority requests. I imagine there is developed queueing theory for that.<p>[1] I have been taught here though that classical queuing theory isn’t adequate for all cases. I think it is here, but I will gratefully accept correction if I’m wrong.
Real life queues can be scary too. I think of how complicated Disney's fast pass system got <a href="https://www.youtube.com/watch?v=9yjZpBq1XBE" rel="nofollow">https://www.youtube.com/watch?v=9yjZpBq1XBE</a>. Luckily with software it is way easier to get more servers than it is to build more theme park rides.
I recently ran into an overload issue and it turned out to be a not-obvious "hard limit" that was mentioned. Everything would be smooth for a bit and then my throughput would be halved after I walked away, backing up the queues indefinitely and paging me again.<p>I had moved a single-broker RabbitMQ from GCP to AWS and the instance type I chose had bandwidth "up to" 10Gbps. Being less familiar with AWS, I did not realize they will actively throttle based on credits because "up to" means "burstable to" regardless of available capacity. My messages are pretty large and I was running out of credits after about an hour.<p>Bandwidth was the last thing I considered since I hadn't had the issue on GCP. Switching to a larger instance with guaranteed bandwidth was a band-aid. Clustering to spread the load between multiple instances will be my longer term fix. Lesson learned, hopefully this helps someone someday.
This is fitting. I just spent 99 minutes trying to register my kid for summer camps. Eventually got a “payment has been processed redirecting to receipt” message… which errored out.
FIFO queues do not fix overload. Fair queuing queues, though, can fix it if the problem is coming from a specific source.<p>I have a web site set up that way. An in-memory table in MySQL is used to handle queuing for a slow operation. The queuing system has some fairness. This works well enough that one site sent bogus requests for a month at a high rate, and it had no effect on actual users.<p>Fair queuing in SQL:<p><pre><code> SELECT domain, requestor_ip_hash, rating_state_int, base_domain FROM ratingqueue AS rqueue
WHERE (rating_state_int = 3 OR rating_state_int = 4)
AND NOT EXISTS(SELECT * FROM ratingqueue
WHERE base_domain = rqueue.base_domain
AND (rating_state_int = 1 OR rating_state_int = 2))
ORDER BY rating_state_int, request_timestamp
LIMIT 1;</code></pre>
What's the snarky bit at the end of article about?<p>> And then of course, there's the use case where you use the queue as a messaging mechanism between front-end threads/processes [...] because your language doesn't support inter-process communications.<p>Isn't it the other way around? A message broker is used as a simple queue but then you immediately have the option to scale up to n producers and m consumers if demand requires is.<p>But even if there is only one single producer and one single consumer you still got the advantage of the decoupling of the two sides of the queue.
Another thing. Queues often lack prioritization of messages; for instance, the importance of a new user signup may be overlooked compared to an address update.
What's tricky is that generally queues are difficult to avoid; they are everywhere in computing. Network has queues, hardware has queues, services have queues, operating systems have queues, frameworks and runtimes have queues, your clients will have queues. You often have very limited visibility on many of those, nevermind having any controls, especially if you are not even aware that the queue exists.
In the first paragraph, Fred mentions and links to Erlang in Anger.<p>When I was coming up to speed on the BEAM this was such an amazing resource. The fact that it's free is bananas. In addition to the load management stuff, he also talked about some observability details that are really helpful (like how to more accurately understand CPU load on the BEAM). Highly recommend.
This article is a version of Theory of Constraints[0] aka Value Stream Mapping:<p>- every system has a bottleneck<p>- fixing things _after_ the bottleneck will have no effect<p>- fixing things _before_ the bottleneck <i>will make the bottleneck worse</i><p><a href="https://en.wikipedia.org/wiki/Theory_of_constraints" rel="nofollow">https://en.wikipedia.org/wiki/Theory_of_constraints</a>
Anyone looking to build a practical solution that involves weighted-fair queueing for request prioritization and load shedding should check out - <a href="https://github.com/fluxninja/aperture">https://github.com/fluxninja/aperture</a><p>The overload problem is quite common in generative AI apps, necessitating a sophisticated approach. Even when using external models (e.g. by OpenAI), the developers have to deal with overloads in the form of service rate limits imposed by those providers. Here is a blog post that shares how Aperture helps manage OpenAI gpt-4 overload with WFQ scheduling - <a href="https://blog.fluxninja.com/blog/coderabbit-openai-rate-limits" rel="nofollow">https://blog.fluxninja.com/blog/coderabbit-openai-rate-limit...</a>
I think we all need to read Enterprise Integration Patterns.<p>You would have an idempotent queue + a circuit breaker pattern. If the queue depth is too large - you break the circuit.<p>But if your profit per request is higher than your infracost per request - why not autoscale till the cows come home? Within reason of course.
For anyone interested in this subject, the book <i>Performance Modeling and Design of Computer Systems: Queueing Theory in Action</i> is really, really good. TFA introduced me to queueing theory and that book made me understand the subject better than anything else I have read.
Speaking of queues, any book on in-depth practical queuing theory? This book is highly recommended by many people, including those on HN: <a href="https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.html" rel="nofollow">https://www.cs.cmu.edu/~harchol/PerformanceModeling/book.htm...</a>. However, a reading group by seasoned researchers said the book was not necessarily practical on production systems: <a href="https://emptysqua.re/blog/review-queue-theory-book/" rel="nofollow">https://emptysqua.re/blog/review-queue-theory-book/</a>.
And this understanding is extremely important when you are working on rate limiting. Instead of controlling the rate of requests, one must control the number of concurrent requests - <a href="https://docs.fluxninja.com/concepts/concurrency-limiter" rel="nofollow">https://docs.fluxninja.com/concepts/concurrency-limiter</a><p>Throughput (number of requests processed) is different from capacity (number of requests that can be handled). Managing capacity is more practical and optimum solution than managing the throughput
Discussed at the time:<p><i>Queues Don't Fix Overload</i> - <a href="https://news.ycombinator.com/item?id=8632043">https://news.ycombinator.com/item?id=8632043</a> - Nov 2014 (59 comments)
Corollary: The part of the system with the least amount of queuing will tend to fail the soonest in a situation of overload, thus you can always move the blame to a different part of the system by increasing queuing at the part that is failing.
The hard truth is that you need a ring buffer and start discarding data, if you can!!!! If you can't then you'll have to wait or batch updates :)
That's why functions like writev/readv exist
It's extremely easy to introduce a backpressure mechanisms into your tech stack by using Go as a valve. If you can arrange data flow through a Go process, even if it's just a small tool inside your PHP/Rust/Javascript dream stack, then you can get backpressure done in about 5 lines of code:<p><pre><code> func valve[T any](ch chan<- T, value T) (ok bool, err error) {
select {
case ch <- value:
return true, nil
default:
return false, ErrSlowDownPlease
} // nonblocking select
}
</code></pre>
The other half of the valve is a goroutine that reads from `ch` and pushes values further down your pipeline. The channel can be buffered or unbuffered, it doesn't matter. Flow through this coupling will only proceed at the rate supported by the goroutine, and you can use this fact to partition your stack into rated flow domains.<p>Sometimes it's better to deploy a simple tool into existing stacks to help patch overload conditions until a "native" fix can be developed.