It's funny how my bugbears from interacting with distributed async messaging (Kafka) are like 90 degrees orthogonal from the things described here:<p>(1) Occasionally have wanted to wonder <i>what the actual traffic is</i>. This takes extra software work (writing some kind of inspector tool to consume a sample message and produce a human-readable version of what's inside it).<p>(2) Sometimes see problems which happen at the broker-partition or partition-consumer assignment level, and tools for visualizing this are really messy.<p>For example you have 200 partitions and 198 consumer threads -- this means that because of the pigeonhole principle there are 2 threads which own 2 partitions. Randomly, 1% of your data processing will take twice as long, which can be very hard to visualize.<p>Or for example 10 of your 200 partitions that are managed by broker B which, for some reason, is mishandling messages -- so 5% of messages are being handled poorly, which may not emerge in your metrics the way you expect. Viewing slowness by partition, by owning consumer, and by managing broker can be tricky to remember to do when operating the system.<p>(3) Provisioning capacity to have n-k availability (so that availability-zone-wide outages as well as deployments/upgrades don't hurt processing) can be tricky.<p>How many messages per second are arriving? What is the mean processing time per message? How many processors (partitions) do you need to keep up? How much <i>slack</i> do you have -- how much excess capacity is there above the typical message arrival rate, so that you can model how long it will take the cluster to process a backlog after an outage?<p>(4) Remembering how to scale up when message arrival rate feels like a bit of a chore. You have to increase the number of partitions to be able to handle the new messages ... but then you also have to remember to scale up every consumer. You did remember that, right? And you know you can't ever reduce the partition count, right?<p>(5) I often end up wondered what the processing latency is. You can approximate this by dividing the total backlog of unprocessed messages for an entire consumer group (unit "messages") by the message arrival rate (unit "arriving messages per second") which gets you something that has dimensionality of "seconds" and represents a quasi processing lag. But the lag is often different per-partition.<p>Better is to teach the application-level consumer library to emit a metric about how long processing took and how old the message it evaluated was - then, as long as processing is still happening, you can measure delays. Both are messy metrics that need you get and remain hands-on with the data to understand them.<p>(6) There's a complicated relationship between "processing time per message" and effective capacity -- any application changes which make a Kafka consumer slower may not have immediate effects on end-to-end lag SLIs, but they may increase the amount of parallelism needed to handle peak traffic, and this can be tough to reason about.<p>(7) Planning only ex post facto for processing outages is always a pain. More than once I've heard teams say "this outage would be a lot shorter if we had built in a way to process newly arrived messages first", and I've even seen folks jury-rig LIFO by e.g. changing the topic name for newly arrived messages and using the previous queue as a backlog only.<p>I wonder if my clusters have just been too small? The stuff here ("how can we afford to operate this at scale?") is super interesting, just not the reliability stuff I've worried about day-to-day.