This arch is how the big players do it at scale (ie. datadog, new relic - the second it passes their edge it lands in a kafka cluster). Also otel components lack rate limiting(1) meaning its super easy to overload your backend storage (s3).<p>Grafana has some posts how they softened the s3 blow with memcached(2,3).<p>1. <a href="https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/6908">https://github.com/open-telemetry/opentelemetry-collector-co...</a>
2. <a href="https://grafana.com/docs/loki/latest/operations/caching/" rel="nofollow noreferrer">https://grafana.com/docs/loki/latest/operations/caching/</a>
3. <a href="https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability/" rel="nofollow noreferrer">https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...</a><p>I know the post is about telemetry data and my comments on grafana are logs, but the arch bits still apply.
I heard several times that Kafka was put in front of elasticsearch clusters for handling traffic burst. You can also use Redpanda, Pulsar, NATS and other distributed queues.<p>One thing that is also very interesting with Kafka is that you can achieve exactly-once semantic without too much efforts: by keeping track of the positions of partitions in your own database and carefully acknowledging them when you are sure data is safely stored in your db. That's what we did with our engine Quickwit, so far it's the most efficient way to index data in it.<p>One obvious drawback with Kafka is that it's one more piece to maintain... and it's not a small one.
If you have distributed concurrent data streams that exhibit coherent temporal events, than at some point you pretty much have to implement a queuing balancer.<p>One simply trades latency for capacity and eventual coherent data locality.<p>Its almost a arbitrary detail whether you use Kafka, RabbitMQ, or Erlang channels. If you can add smart client application-layer predictive load-balancing, than it is possible to cut burst traffic loads by a magnitude or two. Cost optimized Dynamic host scaling is not always a solution that solves every problem.<p>Good luck out there =)
A similar idea [^1] has cropped up in the serverless OpenTelemetry world to collate OpenTelemetry spans in a Kinesis stream before forwarding them to a third-party service for analysis, obviating the need for a separate collector, reducing forwarding latency and removing the cold-start overhead of the AWS Distribution for OpenTelemetry Lambda Layer.<p>[^1] <a href="https://x.com/donkersgood/status/1662074303456636929?s=20" rel="nofollow noreferrer">https://x.com/donkersgood/status/1662074303456636929?s=20</a>
Seems like overkill no? Otel collectors are fairly cheap, why add expensive Kafka into the mix. If you need to buffer why not just dump to s3 or similar data store as a temporary storage array.
I expect it would be far cheaper to scale up tempo/loki than it would be to even run an idle kafka cluster. This feels like spending thousands of dollars to save tens of dollars.
Are there any client side dynamic samplers that can target a maximum event rate? Burstiness with otel has been a thorn in everything that uses it from my experience and it's frustrating.
It'd be nice to have something simpler as an otel processor. Otel could just dump events to local disk as sequential writes then read them back, load permitting.<p>I'm curious how long things stay in Kafka on average and worse case. If it's more than a few minutes, I imagine it lowers the quality of tail based sampling.