Scaling Kafka at Honeycomb

166 点作者 i0exception超过 3 年前

10 条评论

rubiquity超过 3 年前

I've never used Kafka but this post is yet another hard earned lesson in log replication systems where storage tiering should be much higher on the hierarchy of needs than horizontal scaling of individual logs/topics/streams. In my experience the times when you need storage tiering something awful is already happening.During network partitions or other scenarios where your disks are filling up quickly it's much easier to reason about how to get your log healthy by aggressively offloading to tiered storage and trimming than it is to re-partition (read: reconfigure), which often requires writes to some consensus-backed metadata store, which is also likely experiencing its own issues at that time.Another great benefit of storage tiering is that you can externally communicate a shorter data retention period than you actually have in practice, while you really put your recovery and replay systems through their paces to get the confidence you need. Tiered storage can also be a great place to bootstrap new nodes from.

评论 #29398565 未加载

EdwardDiego超过 3 年前

> July 2019 we did a rolling restart to convert from self-packaged Kafka 0.10.0Ouch, that's a lot of fixed bugs you weren't reaping the benefits of >_< What was the reason to stick on 0.10.0 for so long?After we hit a few bad ones that finally convinced our sysop team to move past 0.11.x, life was far better - especially recovery speed after an unclean shutdown. Used to take two hours, dropped to like 10 minutes.There was a particular bug I can't find for the life of me that we hit about four times in one year where the replicas would get confused about where the high watermark was, and refuse to fetch from the leader. Although to be fair to Kafka 0.10.x, I think that was a bug introduced in 0.11.0. Which is where I developed my personal philosophy of "never upgrade to a x.x.0 Kafka release if it can be avoided."> The toil of handling reassigning partitions during broker replacement by hand every time one of the instances was terminated by AWS began to grate upon usI see you like Cruise Control in the Confluent Platform, did you try it earlier?> In October 2020, Confluent announced Confluent Platform 6.0 with Tiered Storage supportTiered storage is slowly coming to FOSS Kafka, hopefully in 3.2.0, thanks to some very nice developers from AirBnB. Credit to the StreamNative team, that FOSS Pulsar has tiered storage (and schema registry) built-in.

评论 #29397166 未加载

krnaveen14超过 3 年前

Based on our experience with Apache Kafka and alternative streaming systems, Apache Pulsar natively addresses the Honeycomb's needs.- Decoupling of Broker & Storage Layer- Tierered Storage (SSD, HDD, S3,...)We use both Kafka and Pulsar in our systems.- Kafka is used for microservices communication and operational data sharing- Pulsar is used for streaming large customer data in thousands of topics

评论 #29399890 未加载

评论 #29401162 未加载

评论 #29402352 未加载

评论 #29399936 未加载

评论 #29401662 未加载

评论 #29401143 未加载

juliansimioni超过 3 年前

I'm not a Kafka user but it seems very similar to my experience running Elasticsearch clusters tuned for low latency response time.There's a complicated mix of requirements for CPU, memory, disk, _and_ network speed, and meeting all of them cost effectively is a real challenge.Similarly, it's easy to build a cluster that performs well until a single node fails. The increased load per node plus the CPU and network cost of replicating data to a replacement instance can really cause trouble.Elasticsearch also runs on the JVM so I'm hoping the new EC2 instance types will work for us too. They look to be really great.

rmb938超过 3 年前

Maybe I missed it but are you able to talk about how many messages a second, partition count and average message size?I run a few hundred Kafka clusters with message counts per second in the tens of millions for some clusters, a few thousand partitions, message sizes around 7kb with gzip compression, and have never needed the amount of CPU and network/disk throughput mentioned. With node counts range between ~10-25. Most of my clusters reaching those speeds at most average around 7Gbps of disk throughput per broker.I have recently started running Kafka in GCP with their balanced ssd disks capping out at 1.2Gbps I'm not seeing much of a performance impact. It requires a few more brokers to reach the same throughput but not having any of the performance and scaling issues mentioned in this post.My brokers are sized a bit differently than mentioned in the post as well, low amount of CPU (maximum 20ish cores) but much more memory around 248GB for my larger clusters. So maybe that has to do with it? Maybe the broker sizes that were chosen are not ideal for the workload?Maybe I've been lucky in my setups but I would like to know a bit more. Having been running Kafka since the 0.10 days and now on 2.6 for all my clusters this type of performance problem seems a bit puzzling.

评论 #29398753 未加载

StreamBright超过 3 年前

>> Historically, our business requirements have meant keeping a buffer of 24 to 48 hours of data to guard against the risk of a bug in retriever corrupting customer data.I have used much larger buffers before. Some bugs can lurk around for a while before noticed. For example the lack of something is much harder to notice.

评论 #29396826 未加载

bashtoni超过 3 年前

This is an awesome write up. I love reading these warts and all accounts - they're always way more useful than the typical case study "we switched to X and it saved us Y%!" marketing posts.One point that makes Intel not look quite so bad performance wise - based on my own benchmarking, I'm pretty sure when this article talks about cores they actually mean vCPUs. In AWS on x86, 1 vCPU is 1 hyperthread, so it's kind of half a core. On Graviton 2, 1 vCPU is one full core, the CPUs don't have hyperthreading. This means that you need 10 Intel cores to do the same work as 16 Graviton cores, not 20. This of course doesn't change the cost savings from switching to arm64.

评论 #29401047 未加载

sealjam超过 3 年前

> …RedPanda, a scratch backend rewrite in Rust that is client API compatibleI thought RedPanda was mostly C++?

评论 #29397050 未加载

throwaway984393超过 3 年前

I'm still just reading the first couple sections, but I already want to give props for an excellent write-up. You're explaining in (mostly) plain language the purpose of Kafka, your specific application of it, and lots of great background detail and history of your implementation's evolution. It's also clear that whoever wrote this knows their Ops. Thank you!

评论 #29401178 未加载

mherdeg超过 3 年前

It's funny how my bugbears from interacting with distributed async messaging (Kafka) are like 90 degrees orthogonal from the things described here:(1) Occasionally have wanted to wonder what the actual traffic is. This takes extra software work (writing some kind of inspector tool to consume a sample message and produce a human-readable version of what's inside it).(2) Sometimes see problems which happen at the broker-partition or partition-consumer assignment level, and tools for visualizing this are really messy.For example you have 200 partitions and 198 consumer threads -- this means that because of the pigeonhole principle there are 2 threads which own 2 partitions. Randomly, 1% of your data processing will take twice as long, which can be very hard to visualize.Or for example 10 of your 200 partitions that are managed by broker B which, for some reason, is mishandling messages -- so 5% of messages are being handled poorly, which may not emerge in your metrics the way you expect. Viewing slowness by partition, by owning consumer, and by managing broker can be tricky to remember to do when operating the system.(3) Provisioning capacity to have n-k availability (so that availability-zone-wide outages as well as deployments/upgrades don't hurt processing) can be tricky.How many messages per second are arriving? What is the mean processing time per message? How many processors (partitions) do you need to keep up? How much slack do you have -- how much excess capacity is there above the typical message arrival rate, so that you can model how long it will take the cluster to process a backlog after an outage?(4) Remembering how to scale up when message arrival rate feels like a bit of a chore. You have to increase the number of partitions to be able to handle the new messages ... but then you also have to remember to scale up every consumer. You did remember that, right? And you know you can't ever reduce the partition count, right?(5) I often end up wondered what the processing latency is. You can approximate this by dividing the total backlog of unprocessed messages for an entire consumer group (unit "messages") by the message arrival rate (unit "arriving messages per second") which gets you something that has dimensionality of "seconds" and represents a quasi processing lag. But the lag is often different per-partition.Better is to teach the application-level consumer library to emit a metric about how long processing took and how old the message it evaluated was - then, as long as processing is still happening, you can measure delays. Both are messy metrics that need you get and remain hands-on with the data to understand them.(6) There's a complicated relationship between "processing time per message" and effective capacity -- any application changes which make a Kafka consumer slower may not have immediate effects on end-to-end lag SLIs, but they may increase the amount of parallelism needed to handle peak traffic, and this can be tough to reason about.(7) Planning only ex post facto for processing outages is always a pain. More than once I've heard teams say "this outage would be a lot shorter if we had built in a way to process newly arrived messages first", and I've even seen folks jury-rig LIFO by e.g. changing the topic name for newly arrived messages and using the previous queue as a backlog only.I wonder if my clusters have just been too small? The stuff here ("how can we afford to operate this at scale?") is super interesting, just not the reliability stuff I've worried about day-to-day.

评论 #29399478 未加载