I think you want controls around:<p>0) Load - I'm too loaded right now, I'm not processing your message. Maybe another worker will. This might be enqueued back into the low latency queue.<p>1) Role Limit - You've sent too many messages in too short of a time, I'm dequeueing this into the high latency queue.<p>2) Cost Limit - You've sent too many <i>expensive</i> messages, I'm dequeing this into the high latency queue.<p>3). Deadletter- Your messages are failing to be processed. We're going to dequeue into deadletter and maybe we're going stop processing them for for now.<p>So you have the normal low latency queue, the high latency backlog queue, and the deadletter queue. Ideally you have SLOs around end to end queue times and monitoring/alerting to let you and your senders know their expectations may be not being met (i.e 99% of messages are processed within n MS).
I found Amazon's Builders Library to have a very insightful list of how to handle building multi tenant queue systems <a href="https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/" rel="nofollow">https://aws.amazon.com/builders-library/avoiding-insurmounta...</a><p>I think one of the big tools is Shuffle Sharding. The article talks about standard sharding by itself as not being enough to provide robustness in multitenant queues. But Shuffle Sharding I.E. assigning users to virtual groups of underlying queues and enqueueing to the queue with the smallest size gets you pretty far. It can limit throughput for individual users but implementing some simple work stealing logic on the consumer helps make sure you keep your throughput up.
There are plenty of articles about writing simple job queues in Postgres using SKIP LOCKED, and I personally like that version quite a bit for cases where you don't want to add another separate component just for the queue. But I haven't seen anyone addressing fairness in such a queue, e.g. for the multi-tenant case.<p>Is there a simple way to get a very rough kind of fairness between tenants in this case? Doesn't have to be actually fair, just fair enough that no tenant gets fully stuck behind a large block of jobs from another tenant.<p>The idea behind SKIP LOCKED is that you just get the first free job (potentially ordered by e.g. a priority column). Doing that in a way that switches between tenants doesn't seem straightforward. You could execute the query once per tenant, but that would be rather inefficient for many tenants. Is there a reasonably straightforward way to achieve this?
>> with Postgres and SKIP LOCKED always cropping up<p>Note that SKIP LOCKED is available in Microsoft SQL Server and MySQL and I believe possibly also in Oracle and IBM DB/2.<p>To the article, I suspect there will be alot of detail challenged here.<p>There's talk in here about "creating queues" to handle the fairness problem. I might be misunderstanding what the hard bit is, but with a database backed queue you don't have to create a queue - queues are virtual depending on what messages are in there and what other information is available such as username or tag. if you want to be "fair" around username then round robin on that, or fair around some other bit of data then tag the message when it goes into the queue and when you are processing, round robin on that.<p>What is the hard bit here, what am I missing?
This problem can be extended to systems like project management, where your workers are not computers but people.<p>For instance: Suppose you have a team of six people, and you're being asked to tackle two projects for two clients, each of equal value to the business. One project will take two weeks to complete; another will take four months to complete.<p>How do you determine a "fair" way to divide up those projects? Do you do one before the other? Do you divide up your team's capacity equally and do them in parallel?<p>This is the same sort of queuing problem described in the post. So a solution to the computer queuing problem might have interesting implications for the broader problem.
Is "step function" a term of art I'm just ignorant of in computer science? AWS Step Functions, sure. How it's used here? No idea.<p>> It enables developers to write declarative step functions...<p>> With thousands of customers running thousands of step functions containing parallel steps....<p>> If you're familiar with step functions, you probably already see what's happening here...
Fair queuing in multi-tenant scenarios is also what describes what is going on in some financial trading systems: these queuing mechanisms are often opened up to regulators to demonstrate the fairness.
Great job abstracting away so much complexity!<p>> each step is a code-level transaction backed by its own job in the queue. If the step fails, it retries automatically. Any data returned from the step is automatically captured into the function run's state and injected on each step call.<p>This is one thing I've seen so many companies spending tons of time implementing themselves, and happens _everywhere_ -- no code apps, finance software, hospitals, anything that deals with ordering system...the list goes on.<p>Glad I no longer need to write this from scratch!
Another place to look for inspiration is the HPC world. Fairshare job queues have been active on multi tenant HPC clusters for decades. Each job typically has a priority value assigned to it based on the submission time, expected resources required, and how much of the cluster the account/user has used recently. Priorities can be continuously updated until the job is released for processing.
For a small side project, instead of queuing all the tasks or writing a custom fair multi-tenant queuing system, I added all the "tasks" to the db, then with a periodic job I query with SKIP LOCKED for pending tasks and enqueue them on my queue system (sidekiq, Goodjob, etc). I only enqueue them if there are less than some amount of jobs pending, it is hardcoded, but I could do it dynamically.<p>The query logic take into consideration the tenants so, a big tenant (or misbehaving) can't block small ones.<p>The periodic job is executed every few second, but it's also enqueued when added new tasks to the db (unless it's already enqueued) or when the query logic detect that there is more jobs available, so the lag is minimal.<p>At the moment I am doing this with only one type of jobs, but I want to expand it to most of the jobs, and let the "query logic" balance the queues.
This is cool but pretty light on the details. Judging by how it reads, they create queues for each function, then create queues of queues for workers. It sounds similar to the QuiCK paper from Apple: <a href="https://www.foundationdb.org/files/QuiCK.pdf" rel="nofollow">https://www.foundationdb.org/files/QuiCK.pdf</a>
A "nuclear option" I've seen in the wild is Queue-It. Puts your entire flow behind a queue that's user-visible, with a waiting list page. Then presumably your own site's logic doesn't do any queuing.
btw "fair" is an entirely subjective and unproductive way to describe a system. There's simply properties and priorities. Many of us have learned fair to mean a great variance of things. From per capita equal load, to you get what you pay for, to anything goes just don't hurt anyone. In families we might see this as each child gets the same number of cookies. In capitalism fair is what you negotiate and agree to. The point isn't what is or isn't fair, but that there is no universal "fair".<p>Should a user be able to jump the queue? What if they're your number one driver of revenue? Or they have a time-sensitive function w/ a deadline? Or if they've produced no load on the system in the past $TIME_UNIT. All of these are not fair, just properties to be designed into a product.