From everything I've read Kafka is a really bad fit for AWS. It is not tolerant of partitioning. They stated this in their own design document where they present it as a CA system. In his Jepsen post on Kafka, Kyle backed this up with more data.<p>Given this, why do people deploy it to AWS? It seems like an invitation to disaster.
Curious whether Cap'n Proto or another zero-copy serialization format might've been a better choice than protobufs? Protobufs still need to parse the message, it's just that the code to do so is automatically generated for you. With Cap'n Proto you can just read them directly off the wire and save them, or mmap a file full and access them.<p>Most of the downsides of Cap'n Proto also don't apply here. Compressing with Snappy will elide all the zero-valued padding bytes. The format of an HTTP message is relatively stable, so you don't get a lot of churn in the message layout. HTTP doesn't have a lot of optional fields, so that's another potential source of Cap'n Proto bloat that doesn't apply to your use case.
My lazy self always wonder how nice it would be if some of these infrastructure designs were always accompanied with a docker/fig configuration example to be used as a start point/proof of concept for people looking for similar solutions.<p>It obviously happens some times [1] [2], but it should be more common...<p>[1] <a href="http://alvinhenrick.com/2014/08/18/apache-storm-and-kafka-cluster-with-docker/" rel="nofollow">http://alvinhenrick.com/2014/08/18/apache-storm-and-kafka-cl...</a><p>[2] <a href="https://registry.hub.docker.com/u/ches/kafka/" rel="nofollow">https://registry.hub.docker.com/u/ches/kafka/</a>
We use netty for transport in similar scenario. Though we have not hard-tested it with the limits mentioned but wouldn't a write-behind cache can write large volume of data..ofcourse there will be a delay but it is not hard to implement.
One thing which is not clear about kafka or kinesis is when you have multiple consumers for the same topic how will they get the data and in what order , and what happens when consumers die down. How do you handle consumers in your data pipeline ?