Kafka vs. Redpanda performance – do the claims add up?

213 点作者 itunpredictable大约 2 年前

16 条评论

agallego大约 2 年前

alex here, original author of redpandais hard to respond to a 6-part blog series content - released all at once - on an HN thread.- what we can deterministically show is data loss on apache kafka with no fsync() [shouldn't be a surprise to anyone] - stay tuned for an update here.- the kafka partition model of one segment per partition could be optimized in both arch- the benefit for all of us, is that all of these things will be committed to the OMB (open messaging benchmark) and will be on git for anyone interested in running it themselves.- we welcome all confluent customers (since the post is from the field cto office) to benchmark against us and choose the best platform. this is how engineering is done. In fact, we will help you run it for you at no cost. Your hardware, your workload head-to-head. We'll help you set it up with both.... but let's keep the rest of the thread technical.- log.flush.interval.messages=1 - this is something we've taken a stance a long long time ago in 2019. As someone who has personally talked to hundreds of enterprises to date, most workloads in the world should err on the side of safety and flushing to disk (fsync()). Hardware is very good today and you no longer have to choose between safety and reasonable performance. This isn't the high latency you used to see on spinning disks.

评论 #35953682 未加载

评论 #35951589 未加载

评论 #35961907 未加载

评论 #35953937 未加载

评论 #35955227 未加载

dangoodmanUT大约 2 年前

This was a nice read! There are a few issues on both sides, some that others have mentioned and some that I have not seen yet:For Redpanda:1. I don't like that they did not include full disk performance, not sure if that was intentional but it feels like it... Seems like and obvious gap in their testing. Perhaps most of their workloads have records time out rather than get pushed out by bytes first, not sure.2. Their benchmark was def selective, sure, but they sell via proof of performance for tested workloads IIUC, no via their posted benchmarks. The posted benchmarks just get them into the proof stage in a sales pipeline.For Kafka (and Confluent, and this test):1. Don't turn off fsync for Kafka if you leave it on with Redpanda, that's certainly not a fair test.Batching should be done on the client side anyway, as most packages already do by default. If you are worried about too many fsyncs degrading performance, batch harder on your clients. It's the better way to batch anyway.2. If confluent cloud is using java 11, then I don't like that java 17 is used for this either. It's not a fair comparison seeing that most people will want it managed anyways, so it gives unrealistic expectations of what they can get3. Confluent charges a stupid amount of money4. The author works for Confluent, so I'm not convinced that this test would have been posted if they saw Redpanda greatly outperform KafkaWith Both:1. Exactly once delivery is total marketing BS. At least Redpanda mentions you need idempotency, but you get exactly once behavior with full idempotency anyway. What you build should be prepared for this, not the infra you use IMO as all you need is one external system to break this promise for the whole system to lose itI prefer Redpanda as I find it easier to run, and Redpanda actually cares about their users whether they are paid or not. Confluent wont talk to you unless you have a monthly budget of at least $10k, Redpanda has extremely helpful people in their slack just waiting to talk to you.Ultimately you don't just buy into the software, you buy into the team backing it, and I'd pick Redpanda easily, knowing that they can actually help me and care without needing to give them $10k.

评论 #35964552 未加载

评论 #35966565 未加载

nemothekid大约 2 年前

>Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety.Respect to the Kafka team as Kafka is an incredible piece of software, but the Mongo guys got torched for eternity for pulling the same shenanigans.

评论 #35952414 未加载

评论 #35951639 未加载

globalreset大约 2 年前

> Issue #1 is that in Kafka’s server.properties file has the line log.flush.interval.messages=1 which forces Kafka to fsync on each message batch. So all tests, even those where this is not configured in the workload file will get this fsync behavior. I have previously blogged about how Kafka uses recovery instead of fsync for safety.And then in this article it's explained how Kafka is actually unsafe:> Kafka may handle simultaneous broker crashes but simultaneous power failure is a problem.just against simultaneous node crashes (whole VM/machine).I mean - sure in practice running in different AZs, etc. will probably be good enough, but technically...

评论 #35951622 未加载

评论 #35952704 未加载

评论 #35950941 未加载

cortesoft大约 2 年前

> Redpanda end-to-end latency of their 1 GB/s benchmark increased by a large amount once the brokers reached their data retention limit and started deleting segment files. Current benchmarks are based on empty drive performance.This seems really disingenuous to use empty drive performance, since anyone who cares about performance is going to be caring about continuous use.

评论 #35950578 未加载

评论 #35974405 未加载

tapoxi大约 2 年前

I'd like to see these on OpenJDK 11, since that's what Confluent is running on and the author makes a point of switching to 17 even though he works for Confluent.In either case, Confluent Platform is ridiculously expensive and approached the costs (licensing alone) for our entire cloud spend. I'd love to see more run-on-k8s alternatives to CFK.

评论 #35950649 未加载

评论 #35950528 未加载

评论 #35950193 未加载

评论 #35950246 未加载

llama052大约 2 年前

We really wanted to try redpanda, but operationally it does not appear to be very k8s* native and infact looks like a lot of one off hand holding to get it working properly.Hopefully that can get ironed out in the future. Until then we will stick with the Strimzi operator and kafka.Also Confluent is absolutely pricing themselves out of the market. We looked at their self hosted confluent operator and they wanted something like $9k per node, when they do nothing but provide an operator. Insanity.

评论 #35953105 未加载

评论 #35954210 未加载

评论 #35953107 未加载

sitkack大约 2 年前

I'd like to see a baseline of fio and iperf3 for these same instances so we know how much raw performance is available for disk, network alone and together.Cloud instances have their own performance pathologies, esp in the use of remote disks.As for RP and Kafka performance, I'd love to see a parameter sweep over both configuration dimensions as well as workload. I know this is a large space, but it needs to be done to characterize the available capacity, latency and bandwidth.

评论 #35954258 未加载

gagejustins大约 2 年前

"I hope you come away with a new appreciation that trade-offs exist, there is no free lunch despite the implementation language or algorithms used. Optimizations exist, but you can’t optimize for everything. In distributed systems you won’t find companies or projects that state that they optimized for CAP in the CAP theorem. Equally, we can’t optimize for high throughput, low latency, low cost, high availability and high durability all at the same time. As system builders we have to choose our trade-offs, that single silver-bullet architecture is still out there, we haven’t found it yet."

评论 #35951381 未加载

skyde大约 2 年前

Author say "Redpanda incorrectly claim Kafka is unsafe because it doesn’t fsync - it is not true".If you don't Fsync the batch, it's possible the server would send response to client saying data was written successfully while the batch is still just in memory and then the server loose power and never write it to disk.Maybe the author have a different definition of unsafe but to me if it's not ACID it's unsafe!

评论 #35955678 未加载

评论 #35951681 未加载

Alifatisk大约 2 年前

"...all this is really just benchmarketing, but as I stated before, if no-one actually tests this stuff out and writes about it, people will just start believing it. We need a reality check."Well said

Dylan1312大约 2 年前

The biggest point of contention here seems to be over whether kafka can still be considered durable/safe when fsync is disabled.Seems like it'd be valuable to have a trusted third party like <a href="https://jepsen.io/" rel="nofollow">https://jepsen.io/</a> test it out! (not related, just a fan of their work)

fulafel大约 2 年前

A tangent but how do distributed robustness properties in face of communication hiccups compare between Redpanda and Kafka? Eg with Raft apparently you can still fail in presence of asymmetric network failures (like in <a href="https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/" rel="nofollow">https://blog.cloudflare.com/a-byzantine-failure-in-the-real-...</a>)

chalcolithic大约 2 年前

I wonder if there's an embedded equivalent for such systems? Something like fasterlog but more mature?

评论 #35953644 未加载

评论 #35951434 未加载

purpleblue大约 2 年前

TLDR: "I work at Confluent, the owners of Kafka, and I have determined through my tests that Redpanda's performance is greatly exaggerated."I don't think we can get a less reliable or trustworthy set of performance tests than when someone's paycheck depends on the outcome of those tests. If Redpanda's performance were found to be better, would he really publish the test results?

评论 #35950799 未加载

评论 #35951166 未加载

评论 #35951569 未加载

评论 #35950910 未加载

评论 #35954199 未加载

评论 #35955680 未加载

评论 #35952926 未加载

globalreset大约 2 年前

EDIT: Thank you for clarification. It is a fair 3 node vs 3 node benchmark.Does this benchmark compare both 3 node Kafka against 3 node Redpanda cluster? It's unclear.

评论 #35950652 未加载

评论 #35950689 未加载

评论 #35950549 未加载

评论 #35951999 未加载

16 条评论

agallego大约 2 年前

评论 #35953682 未加载

评论 #35951589 未加载

评论 #35961907 未加载

评论 #35953937 未加载

评论 #35955227 未加载

dangoodmanUT大约 2 年前

评论 #35964552 未加载

评论 #35966565 未加载

nemothekid大约 2 年前

评论 #35952414 未加载

评论 #35951639 未加载

globalreset大约 2 年前

评论 #35951622 未加载

评论 #35952704 未加载

评论 #35950941 未加载

cortesoft大约 2 年前

评论 #35950578 未加载

评论 #35974405 未加载

tapoxi大约 2 年前

评论 #35950649 未加载

评论 #35950528 未加载

评论 #35950193 未加载

评论 #35950246 未加载

llama052大约 2 年前

评论 #35953105 未加载

评论 #35954210 未加载

评论 #35953107 未加载

sitkack大约 2 年前

评论 #35954258 未加载

gagejustins大约 2 年前

评论 #35951381 未加载

skyde大约 2 年前

评论 #35955678 未加载

评论 #35951681 未加载

Alifatisk大约 2 年前

Dylan1312大约 2 年前

fulafel大约 2 年前

chalcolithic大约 2 年前

I wonder if there's an embedded equivalent for such systems? Something like fasterlog but more mature?

评论 #35953644 未加载

评论 #35951434 未加载

purpleblue大约 2 年前

评论 #35950799 未加载

评论 #35951166 未加载

评论 #35951569 未加载

评论 #35950910 未加载

评论 #35954199 未加载

评论 #35955680 未加载

评论 #35952926 未加载

globalreset大约 2 年前

EDIT: Thank you for clarification. It is a fair 3 node vs 3 node benchmark.Does this benchmark compare both 3 node Kafka against 3 node Redpanda cluster? It's unclear.

评论 #35950652 未加载

评论 #35950689 未加载

评论 #35950549 未加载

评论 #35951999 未加载