TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

OpenTelemetry at Scale: Using Kafka to handle bursty traffic

181 pointsby pranay01over 1 year ago

9 comments

blindedover 1 year ago
This arch is how the big players do it at scale (ie. datadog, new relic - the second it passes their edge it lands in a kafka cluster). Also otel components lack rate limiting(1) meaning its super easy to overload your backend storage (s3).<p>Grafana has some posts how they softened the s3 blow with memcached(2,3).<p>1. <a href="https:&#x2F;&#x2F;github.com&#x2F;open-telemetry&#x2F;opentelemetry-collector-contrib&#x2F;issues&#x2F;6908">https:&#x2F;&#x2F;github.com&#x2F;open-telemetry&#x2F;opentelemetry-collector-co...</a> 2. <a href="https:&#x2F;&#x2F;grafana.com&#x2F;docs&#x2F;loki&#x2F;latest&#x2F;operations&#x2F;caching&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;grafana.com&#x2F;docs&#x2F;loki&#x2F;latest&#x2F;operations&#x2F;caching&#x2F;</a> 3. <a href="https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2023&#x2F;08&#x2F;23&#x2F;how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2023&#x2F;08&#x2F;23&#x2F;how-we-scaled-grafana-cl...</a><p>I know the post is about telemetry data and my comments on grafana are logs, but the arch bits still apply.
评论 #37982093 未加载
评论 #37983239 未加载
评论 #37981908 未加载
francoismassotover 1 year ago
I heard several times that Kafka was put in front of elasticsearch clusters for handling traffic burst. You can also use Redpanda, Pulsar, NATS and other distributed queues.<p>One thing that is also very interesting with Kafka is that you can achieve exactly-once semantic without too much efforts: by keeping track of the positions of partitions in your own database and carefully acknowledging them when you are sure data is safely stored in your db. That&#x27;s what we did with our engine Quickwit, so far it&#x27;s the most efficient way to index data in it.<p>One obvious drawback with Kafka is that it&#x27;s one more piece to maintain... and it&#x27;s not a small one.
评论 #37981934 未加载
评论 #37981867 未加载
评论 #37981690 未加载
评论 #37982122 未加载
评论 #37981746 未加载
评论 #37982164 未加载
Joel_Mckayover 1 year ago
If you have distributed concurrent data streams that exhibit coherent temporal events, than at some point you pretty much have to implement a queuing balancer.<p>One simply trades latency for capacity and eventual coherent data locality.<p>Its almost a arbitrary detail whether you use Kafka, RabbitMQ, or Erlang channels. If you can add smart client application-layer predictive load-balancing, than it is possible to cut burst traffic loads by a magnitude or two. Cost optimized Dynamic host scaling is not always a solution that solves every problem.<p>Good luck out there =)
chris_armstrongover 1 year ago
A similar idea [^1] has cropped up in the serverless OpenTelemetry world to collate OpenTelemetry spans in a Kinesis stream before forwarding them to a third-party service for analysis, obviating the need for a separate collector, reducing forwarding latency and removing the cold-start overhead of the AWS Distribution for OpenTelemetry Lambda Layer.<p>[^1] <a href="https:&#x2F;&#x2F;x.com&#x2F;donkersgood&#x2F;status&#x2F;1662074303456636929?s=20" rel="nofollow noreferrer">https:&#x2F;&#x2F;x.com&#x2F;donkersgood&#x2F;status&#x2F;1662074303456636929?s=20</a>
bushbabaover 1 year ago
Seems like overkill no? Otel collectors are fairly cheap, why add expensive Kafka into the mix. If you need to buffer why not just dump to s3 or similar data store as a temporary storage array.
评论 #37979976 未加载
评论 #37979547 未加载
评论 #37981885 未加载
评论 #37979617 未加载
评论 #37982198 未加载
nicognawover 1 year ago
Signoz is too good at SEO.<p>Early days, I looked up otel and observability stuff, and I always saw Signoz articles on the first screen.
评论 #37984582 未加载
daurnimatorover 1 year ago
I expect it would be far cheaper to scale up tempo&#x2F;loki than it would be to even run an idle kafka cluster. This feels like spending thousands of dollars to save tens of dollars.
评论 #37981849 未加载
评论 #37981045 未加载
评论 #37982217 未加载
评论 #37981701 未加载
评论 #37992332 未加载
anacrolixover 1 year ago
Are there any client side dynamic samplers that can target a maximum event rate? Burstiness with otel has been a thorn in everything that uses it from my experience and it&#x27;s frustrating.
评论 #37994797 未加载
nijaveover 1 year ago
It&#x27;d be nice to have something simpler as an otel processor. Otel could just dump events to local disk as sequential writes then read them back, load permitting.<p>I&#x27;m curious how long things stay in Kafka on average and worse case. If it&#x27;s more than a few minutes, I imagine it lowers the quality of tail based sampling.