Exactly-once Semantics: How Kafka Does it

259 点作者 listentojohan将近 8 年前

11 条评论

sks将近 8 年前

Providing API for building applications that have transactions and help with idempotent producing is really exciting. This will help lower a lot of pains associated with building stream processing systems. Doing it with very little performance overhead is amazing.I do feel that calling it "Exactly-once delivery with kafka" is slightly misleading as this requires the applications to be written in a certain way. The title makes is sound too general and borders on claiming something that is close to impossible. I dont want to be too critical here as the author was very honest with what this means in the blog post. Regardless of the title this is an amazing feature.

评论 #14671576 未加载

评论 #14671530 未加载

nicktelford将近 8 年前

These are some pretty huge improvements to what was previously the weakest link in Kafka's API: the Producer.However, it's important to note that this can only provide you with exactly-once semantics provided that the state/result/output of your consumer is itself stored in Kafka (as is the case with Kafka Streams).Once you have a consumer that, for example, makes non-idempotent updates to a database, there's the potential for duplication: if the consumer exits after updating the database, but before committing the Kafka offsets. Alternatively, it can lead to "message loss" if you use transactions on the database, and the application exits after the offsets were committed, but before the database transaction was committed.The traditional solution to this problem is to provide your own offset storage for this consumer within the database itself, and update these offsets in the same transaction as the database modifications. However, I'm not certain that, combined with the new Producer API, this would provide exactly-once semantics.Even if it doesn't, it's still a huge improvement, and significantly reduces the situations under which duplicate messages can be received.

评论 #14671327 未加载

united893将近 8 年前

There's lots of words in this article (and design doc: <a href="http://goo.gl/EsdTXo" rel="nofollow">http://goo.gl/EsdTXo</a> ) but it's missing a very simple key points?1. State diagram: What are the messages exchanged between Producer and Consumer (and how many round trips to confirm a message)2. At what point does each side consider the message to have been delivered?3. Has this been tested empirically? (i.e: Setup producer, consumer, and partition/kill each side randomly to see if messages get lost)The one thing I don't understand is the following. The two parties communicate by message passing. At some point the message will transition to a new state (i.e: delivered). That transition cannot happen on both sides at the same time. So how do you handle the failure of sending of the last message? Do you stage messages until after the timeout period has passed?

评论 #14672123 未加载

评论 #14672124 未加载

addisonj将近 8 年前

This is pretty cool stuff. I have been working with Apache Flink where the computational model is exactly once (which is still hugely useful) but it still comes with caveats about duplicate messages downstream, which makes writing to downstream db or having multiple flink processes tricky. Often times this is done by windowing data and producing complete snapshots of that window (i.e. you need to keep all the state for a given window) such that the message downstream is idempotent and then aggregating windows together (which can be non trivial for certain aggregations).Once flink (and other systems) support this, I think it will really be a game changer. It will allow for doing things like sending an increment downstream rather than needing to keep all that state. Instead of mostly seeing stream processing as a thing we do for analytic use cases, it can really become the backbone for streaming applications.Event Sourcing/CQRS has been an idea for a while, but in practice, is difficult to do because to the inherent difficulty in dealing with message semantics and consistency (see <a href="https://data-artisans.com/blog/drivetribe-cqrs-apache-flink" rel="nofollow">https://data-artisans.com/blog/drivetribe-cqrs-apache-flink</a> for a good write up about such an app). The ability to independently optimize for both reads and writes while also not having to make all messages be idempotent, in my mind, will make this feasible for a broad range of teams and won't require a huge amount of work in thinking up clever message formats or working through as many failure scenarios.That being said, I fully expect this to bite hard when the caveats aren't understood (such as when interacting with external databases) and there are still other hard problems. Like creating a consistent down stream view of an app in terms of business events rather than hooking into a database transaction log.Still though, I think these are the sort of solutions needed to make distributed systems easier, even though there are lots of caveats :)

wheaties将近 8 年前

Boy would I love to see a Jepsen test of this one.

评论 #14672230 未加载

jdennaho将近 8 年前

So its exactly once semantics, not exactly once delivery. Adding a dedup is not exactly once delivery, its being idempotent, its exactly once commit, we've had this for years. Having clients request for committed messages and keep track of their progress is not exactly once delivery, its exactly once request. Exactly once delivery has messages pushed to clients. As blogs have noted this is not possible.

ckharmony将近 8 年前

<a href="https://techcrunch.com/2017/06/30/confluent-achieves-holy-grail-of-exactly-once-delivery-on-kafka-messaging-service/" rel="nofollow">https://techcrunch.com/2017/06/30/confluent-achieves-holy-gr...</a>There is a crazy claim of 'solving the unsolved problem for so many years' from Confluent CTO Neha Narkhed as mentioned below from the above article.“It’s kind of insane. It’s been an open problem for so many years and Kafka has solved it — but how do we know it actually works?” she asked echoing the doubts of the community.This is solved only in consumer-transform-produce scenario addressing kafka streams exactly once requriements. This is a nice feature for kafka streams but it is better to avoid tall claims like the above.Actually blog post also started with major claims but it was made clear in later paragraphs on how it is acheived through de-duplicating messages in broker and the limitations on consumer side. Title of the blog post and comments in the tech crunch does not seem to be in good spirit.Idempotent producer does not give any API to solve exactly once producer fecthing messages from target systems and sending them to topics.I like kafka and the design choices taken to keep things simple for users but all these tall claims should have been avoided.

k2xl将近 8 年前

Does this just mean (in TL;DR form) that Kafka now has producers generate an ID for each message they send, and Kafka deduplicates it now instead of requiring deduplication on the end consumer?

评论 #14671895 未加载

评论 #14671585 未加载

boredandroid将近 8 年前

Dear hacker news:1. Please read the section entitled "Is this Magical Pixie Dust I can sprinkle on my app?" Before making angry comments. Answer: for general consumer apps just consuming messages, no. However, Kafka's design, which let's the consumer control it's position in the log, combined with this feature which eliminates duplicates in the log make building end-to-end exactly once messaging using the consumer quite approachable. For stream processing using Kafka's Streams API (<a href="https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/" rel="nofollow">https://www.confluent.io/blog/introducing-kafka-streams-stre...</a>), where you are taking input streams, maintaining state, and producing output streams, the answer is that it actually kind of is like magic pixie dust, you change a single config and get exactly-once processing semantics on arbitrary code. Obviously you still need to get the resulting data our of Kafka, but when combined with any off-the-shelf Kafka connector which maintains exactly-once you can get this for free. So for that style of app design you actually can get correct results end-to-end without needing to do any of the hard stuff.2. Someone is going to come and say "FLP means that exactly once messaging is impossible!" or something else from a half-understood tidbit of distributed systems theory they picked up on a blog. Let me preempt that. FLP is about the impossibility of consensus in a fully asynchronous setting (e.g. no timeout-based failure detection). Of course as you know the vast majority of the systems you use in AWS or your own datacenter depend in deep ways on consensus. Kafka itself is a CP log, about as close of a direct analog to consensus as you could ask for. Obviously Kafka and all these systems are "impossible" in the same sense that if you can make the network or other latency issue bad enough you can make the system unavailable for writes. This feature doesn't change that at all, it just piggybacks on the existing consensus Kafka does. It doesn't violate any theorems in distributed systems theory: Kafka and any consensus-based system can't work in a fully asynchronous setting, Kafka was a CP system in the CAP sense prior to this feature and this feature doesn't change that guarantee.For those who want a deeper dive into how it all works there is a longer write up on the design here: <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging" rel="nofollow">https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+E...</a> The use for stream processing is described here: <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics" rel="nofollow">https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A...</a>

评论 #14671627 未加载

makr96将近 8 年前

Is this similar to MQTT qos 2 ?

pfarnsworth将近 8 年前

Sounds great, but I'll wait until Aphyr has done his worst to prove that it doesn't work.

评论 #14673421 未加载