Queue Despair: Ordering and Poison Messages

68 点作者 tbonesteaks超过 3 年前

17 条评论

> Adding a timestamp to each message is an easy way for consumers to discard any out-of-order messages.Not correct, but it's very easy to think timestamps will solve this. Timestamps aren't good because system times aren't synced across different computers precisely. Meaning if Producer A creates the first event, and Producer B creates a second event 50ms after (imagine a single row gets updated very quickly twice), but the system time on Producer A is 100ms ahead of Producer B and the event from Producer B gets to the consumers first (variable network latency), the event from Producer A will look like the latest event from a timestamp perspective and overwrite the Producer B event.One way to solve is it to use not use timestamps, but use a monotonically increasing version number associated with a row that gets updated for every event/update or whatever and is sent along with the event message payload. The book, Designing data intensive systems, goes into this problem a whole lot. Recommend it to anyone discussing architecture. Issues like this will seem obvious to you after reading

评论 #29489604 未加载

评论 #29507520 未加载

评论 #29492730 未加载

评论 #29489722 未加载

评论 #29489784 未加载

mfateev超过 3 年前

I think queues are the wrong abstraction to model business processes. That's why a trivial issue like a non recoverable failure during processing a message becomes such a headache. The same goes for ordering. An orchestrator like temporal.io allows modeling your business use case using higher level abstractions that hide all this low level complexity.Disclaimer: I'm the tech lead of the temporal.io open source project and the CEO of the affiliated company.

评论 #29492749 未加载

评论 #29490330 未加载

danesparza超过 3 年前

Ordering is too expensive. Don't ever count on it when using an asynchronous queue. It's akin to storing session in a cache -- you're mixing your metaphors.A queue should NEVER drop messages - otherwise it's a shit queue. Or you have a bug in your application code that needs to be fixed.Poison messages are DEFINITELY A SMELL. This means you essentially have a broken interface contract. The code that is adding messages is expecting one thing -- code that is processing messages is expecting a different thing. It needs to be fixed by clearly documenting your message queue expectations and fixing your code. Most likely you need to add clear expectations for the lifetime of a message.

评论 #29485564 未加载

评论 #29487769 未加载

评论 #29492785 未加载

outsomnia超过 3 年前

Rejecting a poison message explicitly IS sufficiently processing it.It's common to have windows in time where two or more sides may not have agreed what happened before they lose communication. Most of the problem can be solved by idempotency, so when the peer retries, the receiver understands it is looking at a duplicate transaction and can discard it indicating that it succeeded.

评论 #29491341 未加载

cheradenine_uk超过 3 年前

Yes, this is a hard problem. Not even partitioning by tenant will always help you.This is fundamentally equivalent to database ACID constraints, and the other modes described are great in the same way that if you're able to relax some of the ACID constraints in your code (say, by not being SERIALIZABLE), you get in return nice things (like reads never deadlocking).If you can't know if message N will change the outcome of processing message N+M, then you have to resolve that before you can proceed - as surely as a serializable database will wait for the outcome of transaction N before being able to proceed with N+M.

kimi超过 3 年前

Apart from your unit tests, there is no such thing as "Messages do need to be strictly ordered and messages cannot be lost". You can WISH for messages to come in the right sequence, and even count on it in terms of optimization, but if an event tracks something that happened, and that event comes late - or after a week - your system cannot say "Too bad, I told you, only events in the right order here" (which, at this point, is invalid in itself, as you missed some).

评论 #29490714 未加载

matsemann超过 3 年前

> Even if we ignore poison messages, strict ordering on its own isn't that easy to pull off. Namely, you're limited to a single consumer and can only fetch one message at a time.Yup. If you want to be sure, you need to persist which message for each entity X you have already processed and ignore older ones. And also make sure you handle race conditions where both messages are handled almost at the same time at two different consumers, by using a lock in a db or so. Which both are annoying, ideally I could just process messages without any care.Spent all day trying to architecture this for a new queue where ordering matters but we need lots and lots of consumeres working at the same time prefetching messages. My case is actually a bit similar to the one in the article about a "stream of vehicle positions". I only really care about the latest one. Problem is it's hard to know which is the latest one without having a db and check if I've already processed something more recent. Any other ideas on how to efficiently solve this? As in, strategies to handle messages arriving out of order, so I can avoid that as a requirement.

评论 #29488707 未加载

mperham超过 3 年前

Sidekiq has a lot of code to deal with both of these issues.Sidekiq does not guarantee ordering within a queue; that’s a terrible, very expensive guarantee. Developers don’t want total ordering of jobs within a queue, they want to know that Job A will fully execute before Job B. There might be 1000 other jobs in the queue that are completely independent of that ordering but we’ve screwed ourselves by forcing total queue ordering. Instead Sidekiq Pro provides a workflow API, Sidekiq::Batch, which allows the developer to author higher-level workflows for Job A -> Job B which provides the ordering guarantee.For poison pills, we detect jobs which were running when a Sidekiq process died. If this happens multiple times, the job will be sent to the dead letter queue so the developer can deal with them manually. If they were part of a Batch, the workflow will stall until the developer fixes the issue and executes the job manually to resume the workflow.

PaulHoule超过 3 年前

Ordering is an expensive property.I built a system where<pre><code> 1. Sensor events were picked up by a ZWave device connected to Samsung SmartThings 2. SmartThings would call a AWS lambda function I wrote (SmartThings lives in AWS so this is efficient) 3. My lambda function posts an event to an SQS queue 4. My home server takes the event off SQS and posts it to RabbitMQ 5. A queue listener takes events from RabbitMQ and takes an action </code></pre> So long as I was using ordered SQS queues I would sometimes get a 5 second delay to turn on a light. When I turned off ordering the latency was perceptible but didn't make me want to jam the button multiple times.

评论 #29485299 未加载

评论 #29485320 未加载

评论 #29486590 未加载

thingification超过 3 年前

One way to deal with this is to divide up your event streams into small streams - say one per order. Those small streams then may be aggregated into a larger stream so that you can just process events for all the orders together, for example.If you hit a poison message, block just that smaller stream, not the aggregated larger stream. Once you fix the problem, reprocess the entire small stream starting from the poison event, or the next event after that. The "entire" stream here might be just a handful of events.Greg Young's Event Store (<a href="https://www.eventstore.com/" rel="nofollow">https://www.eventstore.com/</a>) works this way (there's a $by_category projection that produces the aggregated streams).Caveat: I haven't actually implemented this mechanism because I've been able to get away without it, because we have some legacy event streams that aren't split up in this way, and because nobody else has yet added support for it to the tools I'm using.

rhodin超过 3 年前

RabbitMQ has poison message handling in Quorum queues [0]. They are also FIFO.[0] <a href="https://www.rabbitmq.com/quorum-queues.html#poison-message-handling" rel="nofollow">https://www.rabbitmq.com/quorum-queues.html#poison-message-h...</a>

politician超过 3 年前

Typically with busted messages, an early and easy thing to do is to just shunt them off into a "dead-letter queue". That's just the name of another queue where messages are manually handled.

评论 #29519808 未加载

tlarkworthy超过 3 年前

I remember trying to productionize an ordered service and the SRE's were banging on about messages-of-death. Their band aid solution were isolated regions. i.e. DO NOT LET YOUR REGIONAL SERVICES COMMUNICATE. What they were worried about were global cascading failures, if/when someone pushes a mistake to prod.It's kind of a shitty solution to the problem but there you have it, maybe it's the best that can be done. Rollout code changes gradually in individual regions and make sure a bug doesn't bring everything down.

darkr超过 3 年前

Total ordering is rarely required, and even more rarely actually possible (without lamport or atomic clocks etc).For more common use cases it is possible to provide the minimum guarantees required to reliably reconstruct a domain object throughout a distribute system whilst still providing a fuck-ton of scope for concurrency, batching and high throughput through better partition key choice, informed by:A: The maximum ordering guarantees that can be provided by the data sourceB: the minimum ordering guarantees required to reconstruct a domain object

eternalban超过 3 年前

Introduce processing log messages in order to maintain order and not lose any messages, even a poisoned one. The faulty messages are processed as any other message (write a 'did msg x : faulted') and then go to a fault-queue.If you have inter-dependence between messages, you need to have a message id scheme that shows the interdependence. For example, a hierarchical message order can have a .dot delimited scheme. If a poison message is a parent, the subsequent message can go to the fault-queue as well.

helge9210超过 3 年前

My understanding that not losing any messages and strict ordering correspond to "exactly once" delivery which is not possible in general case.

评论 #29486423 未加载

nickkell超过 3 年前

That last part about a transactional outbox rings very true. I’ve been at a few places now where people expect the message bus to always be available