You Cannot Have Exactly-Once Delivery

113 pointsby tylertreatabout 10 years ago

17 comments

brianpgordonabout 10 years ago

Well, you have have exactly-once delivery unless there are network partitions, which is not a particularly surprising limitation. As the author admits, it takes considerable cleverness in the implementation details in order to achieve that abstraction, but plenty of tools have done it."People often bend the meaning of “delivery” in order to make their system fit the semantics of exactly-once, or in other cases, the term is overloaded to mean something entirely different. State-machine replication is a good example of this. Atomic broadcast protocols ensure messages are delivered reliably and in order. The truth is, we can’t deliver messages reliably and in order in the face of network partitions and crashes without a high degree of coordination. This coordination, of course, comes at a cost (latency and availability), while still relying on at-least-once semantics."This reminds me of the problem of reliable data transfer over an unreliable network. It's theoretically impossible, but TCP is still a practical, useful abstraction.

评论 #9268500 未加载

Animatsabout 10 years ago

Sure you can. What you can't have is, after a loss of communication, unambiguous knowledge about whether the other end got the last message.When communication is reestablished, that issue can be sorted out. See "two-phase commit".

评论 #9267812 未加载

评论 #9267781 未加载

评论 #9268239 未加载

fskabout 10 years ago

Then how does Domino's get me a pizza within 30 minutes?

评论 #9267880 未加载

评论 #9268940 未加载

评论 #9267587 未加载

评论 #9267904 未加载

zhengabout 10 years ago

Most systems I know which require this just do at-least once on the sending side and dedupe on the receiving side. If you build this into your framework, to applications you have exactly-once barring unbounded partitions and the like.

评论 #9267693 未加载

jacques_chesterabout 10 years ago

My understanding of the FLP result is that it only applies to algorithms without timeouts.Because it restates that, without timeouts, in theory, you could be waiting forever for a message to arrive. It's formally proved and whatnot, but it's not a super-surprising result when you restate it in common terms.For message delivery, it seems that FLP says that you need to give up at-least-once, because at-least-once ostensibly requires an infinite number of retries. I'm not sure how it would apply to at-most-once.Exactly-once is, so far as I can tell, a bit of a strawman. But it's a strawman that we all fall for, the first time we break stuff into different systems.As a disclaimer, I am still wrapping my head around this stuff. Don't rely on me. Not even if your name is Salvatore.

评论 #9266984 未加载

评论 #9267235 未加载

zkhaliqueabout 10 years ago

Is it just me or did the following alliteration catch someone's eye?DD: I myself shared many of these misconceptions, so I try not to demean or dismissEE: but rather educate and enlighten, hopefully while sounding less preachy than that just did.FF: I continue to learn only by following in the footsteps of others.

评论 #9267646 未加载

评论 #9267922 未加载

eclarkabout 10 years ago

No you can't have exactly once delivery. However you can mitigate this if your datastore for your output from message processing is the same as your datastore for queue. With that and atomic mutations (de-queue and return result), it does allow practical solutions for almost all edge cases.

jermoabout 10 years ago

Another great read on this subject "Exactly-Once Delivery May Not Be What You Want" <a href="https://brooker.co.za/blog/2014/11/15/exactly-once.html" rel="nofollow">https://brooker.co.za/blog/2014/11/15/exactly-once.html</a>

peterwwillisabout 10 years ago

I hope the author isn't making an argument that network partitions are somehow the only consideration of whether a message is delivered. Even on a single user single process machine, delivery can't be totally guaranteed.'exactly once' delivery sounds like 'guaranteed perfect one time delivery under all circumstances', which is a dumb way to rephrase an idea whose literal translation is to deliver something one time and not more or less.In other words, of course you can deliver something exactly one time. Just not every time.

billpgabout 10 years ago

Shameless plug:I wrote about this issue with a lot of APIs some time ago. When a connection is broken unexpectedly, the protocol simply has no way to recover the state of play and there's a risk of duplicate transactions popping up.<a href="http://blog.hackensplat.com/2014/07/is-your-api-broken.html" rel="nofollow">http://blog.hackensplat.com/2014/07/is-your-api-broken.html</a>

rusanuabout 10 years ago

Having spent 7 year of my life implementing an Exactly-Once-In-Order (EOIO) messaging system in SQL Server Service Broker[0] (SSB), I take somehow exception to the author claim.Here is how SSB achieves EOIO:- initiator establishes intent to communicate using BEGIN DIALOG[1] statement (SSB dialogs are the equivalent of a durable, long lived, TCP session). This creates the necessary state in the database by creating a row in sys.conversation_endpoints, with initial send_sequence_number 0.- sender issues SEND[2] statement. The message is assigned the current send_sequence_number (0), the sys.conversation_endpoint send_sequence_number is incremented to 1, and the message is inserted in sys.transmission_queue- after commit of SEND transaction a background transmitter reads the message from sys.transmission_queue, connects to destination, delivers the message over wire (TCP).- target reconstructs message from wire, in a single transaction creates a row in it sys.conversation_endpoints (receive_sequence_number is 0) and delivers the message into the destination queue- after commit of the message delivery, the target constructs an acknowledgement message and sends it back to initiator over the wire- the sender gets the acknowledgement of message 0 and deletes the message from sys.transmission_queue- sender may retry delivery periodically if it does not receive the ack- target will send back an ack immediately on receipt of a duplicate (message sequence number is less than current receive_sequence_number)What this protocol achieves is idempotency of delivery, hidden from the messaging application. Database WAL ensures stability in presence of crashes. Eg. if the target crashes in the middle of enqueueing the message into destination queue then the entire partial processing of the enqueue is rolled back on recovery and next retry from sender will succeed. If target crashes after processing is committed but before sending the ack then on recovery the enqueue is successful and the next retry from initiator will immediately send ack an ack, allowing initiator to delete the retried message and make progress (send the next message in sequence). Note that there is no two-phase-commit involved.Retries of unacknowledged messages occur for the dialog lifetime, which can be days, months, even years. Databases are good at keeping state for so long. SSB uses logical names for destination (from 'foo' to 'bar') and routing can be reconfigured mid-flight (ie. the location where 'bar' is hosted can be changed transparently). Long lived state and routing allow for transparent reconfiguraiton of network topologies, handle downtime, manage disaster (target is lost and rebuild from backups). Most of the time this is transparent to the SSB application.Furthermore, the guarantees can be extended to application semantics as well. Applications dequeue messages using RECEIVE[3] statement. In a single transaction the application would issue a RECEIVE to dequeue the next available message, lookup app state perteining the message, modify the state, send a response using SEND[2], commit. Again WAL guarantees consistency, after a crash everything is rolled back and the application would go again through exactly the same sequence (the response SEND cannot communicate anything on the wire until after commit, see above).So EOIO is possible.One has to understand the trade offs implied. Something like SSB will trade off latency for durability. Applications need not worry about retries, duplicates, missing messages, routing etc as long as they are capable of handling responses comming back hours (or maybe weeks) after the request was sent. And application processing of a message is idempotent (RECEIVE -> process -> crash -> rollback -> RECEIVE -> re-process) only as long as the processing is entirely database bound (update states in some app tables, not make REST calls). Yet such apps are not unusual: they use some database to store state and communicate with some other app that also uses a database to store state. Unlike most messaging systems, SSB stores the messages in the database thus achiving WAL consitency along with the app state. Many SSB applications are entirely contained in the database, the code itself is contained. They use SSB activation [4] to react to incoming message, without keeping any state in memory.In SSB both the initiator and the sender are monolitic, SMP systems (not distributed). Together the two form a distributed system. It trades of availability over partitioning, but one has to understand how this trade off occurs. In case of parittioning (target is unreacheable) the application continues to be available locally (the SEND statement succeeds). If the netowrk paritioning is not resolved over the lifetime of the dialog, then the applicaiton will see an error. If the paritioning is resolved then message flows resumes and the applicaiton layer responses start showing up in the queue. Again, activation and durable state make this easy to handle, as long as a latency of potentially days makes sense in the business. Shorter lifetimes (hours, minutes) are certainly possible and in such cases, if network partitioning is not resolved in time, the timeout error will occur sooner.<pre><code> [0] https://msdn.microsoft.com/en-us/library/bb522893.aspx [1] https://msdn.microsoft.com/en-us/library/ms187377.aspx [2] https://msdn.microsoft.com/en-us/library/ms188407.aspx [3] https://msdn.microsoft.com/en-us/library/ms186963.aspx [4] https://technet.microsoft.com/en-us/library/ms171617.aspx</code></pre>

ryanjshawabout 10 years ago

This topic seems to come up regularly, but I feel the discussions I see here are far too heavy to digest for the people who would benefit most from understanding the issues being presented (OP post is > 1,300 words).I believe what it really comes down to is that people new to distributed processing think they want "exactly once delivery", but later they (hopefully) learn they really want "exactly once processing". For example:> We send him a series of text messages with turn-by-turn directions, but one of the messages is delivered twice! Our friend isn’t too happy when he finds himself in the bad part of town.This is easily resolved by amending each message with an ordinal indicator, e.g. "1. Head towards the McD", "2. Turn left at the traffic lights", etc. The receiver then dedupes and the instructions follow "exactly once processing". Processing messages exactly once is a "business rule" and the responsibility for doing so lies in the business logic of the application.This example also brings up another typical point of confusion in building distributed systems: people actually want "ordered processing", not "ordered delivery". The physical receiving order does not matter: your friend will not attempt to follow instruction #2 without first following instruction #1. If instruction #2 is received first, your friend will wait for instruction #1.It's also important to note that the desired processing order of the messages has nothing to do with the physical sending order: I could be receiving directions to two different places from two different people, and it doesn't matter what order they are sent or received, just that all the messages get to me! A great article covering these topics in more detail with other real world examples is "Nobody Needs Reliable Messaging" [1].I think it is useful to try understand why new distributed system builders run into these difficulties. I suspect they try to apply local procedure call semantics to distributed processing (a fool's errand), and message queue middleware works well enough that a naive "fire and forget" approach is the first strategy they attempt. When they subsequently lose their first message (hopefully before going into production), it's natural to think in terms of patching (e.g. distributed transactions, confirmation protocols, etc.) rather than to consider if the overall design pattern is appropriate.Oddly, there is at least one very well designed solution that addresses these challenges - the Data Distribution Service (DDS) [2] - but I almost never see or hear about it at any of my clients.[1] <a href="http://www.infoq.com/articles/no-reliable-messaging" rel="nofollow">http://www.infoq.com/articles/no-reliable-messaging</a> [2] <a href="http://portals.omg.org/dds/" rel="nofollow">http://portals.omg.org/dds/</a>

评论 #9267545 未加载

einarvollsetabout 10 years ago

But you can have it with probability 1...

dbenhurabout 10 years ago

Life gets more pleasant when you internalize that consistency is not aligned with how the universe works and idempotence is your friend.

jv22222about 10 years ago

Could the blockchain be used as a send only once system?I mean if you used the blockchain as a message queue it could be pretty reliable, no?

评论 #9268080 未加载

评论 #9267827 未加载

williamcottonabout 10 years ago

Would a system that prevents double-spending of a digital currency be equivalent with exactly-once delivery?

评论 #9272459 未加载

coopsabout 10 years ago

The best you can do is two-phase commit.

评论 #9267473 未加载