Since I maintain a pretty large ETL (batch) application for a living, I am genuinely curious about this. How do you handle failure in event-processing systems? I mean in batch, it's simple - if there is a record (event) that causes unexpected failure (or the program fails for other reason, for example it runs out of space), we just restart the batch.<p>But in event processing, unless you can afford yourself to skip events, how do you deal with that sort of thing, especially if the processing needs to keep track of internal state between events?<p>I read about event-sourcing, which kinda is a solution to that, but add checkpoints and you have pretty much batch processing again.
"The Apache Flink project is a relative newcomer to the stream-processing space. Competing Open-Source platforms include Apache Spark, Apache Storm, and Twitter Heron."<p>Can someone explain why Apache are creating projects that compete with each other? Why not focus on one?
I'm the author of this Mux blog post and would love to take any questions or comments, as well as suggestions for future posts. Thank you for your interest!
I'd like to see more about how they used Flink, and less about their system architecture (which give great details, up until the data is processed with Flink).
We ditched Spark Structured Streaming for Flink for a Kafka consumer, processing 3B events per day. Its been extremely stable so far, and half the cost of the spark cluster