A lot of words without any concrete proposals on how to solve the problem.<p>Telemetry is captured because after-the-fact analysis can’t be done retrospectively otherwise. If you can solve that time travel problem, people would capture less telemetry. I think the key thing is if you can do anomaly detection to capture the rare events because 90% of telemetry is garbage happy case telemetry that doesn’t really give you any extra insight. But doing that anomaly detection cheaply and correctly is extremely hard.
I have a different take<p>>Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.<p>YES. Let them send all the data. The best place to solve for it is at Ingestion.<p>There's typically 5 different stages to this process.<p>Instrumentation -> Ingestion -> Storage -> Query (Dashboard) -> Query (Alerting)<p>Instrumentation is the wrong place to solve this.<p>Ingestion - Build pipelines that allow to process this data and provide for tools like streaming aggregation, cardinality controls that allow to 'process it' or act on anomalous patterns. This atleast makes working on observability data 'dynamic' instead of having to go change instrumentation always.
Storage - Provide blaze (2hours), hot(1 month), cold(13 months) of tiered data storage with indipendent read paths.<p>This, in my opinion has solved for the bulk of cost & re-work challenges associated with telemetry data.<p>I believe, Observability is the Big Data of today, without the Big Data tools! (Disclosure: I work at Last9.io and we have taken a similar approach to solve for these challenges)<p><a href="https://last9.io/data-tiering/" rel="nofollow">https://last9.io/data-tiering/</a>
People put too much junk in their logs. Most logs are irrelevant. Companies have businesses selling logging solutions. They sponsor developer conferences hiding this simple truth.<p>Just say no to the logging industrial complex.
My solution[1] to this problem is to do what they did in the Apollo Guidance Computer; log to a ring buffer and only flush it (to disk or wherever) on certain conditions.<p>1. <a href="https://www.komu.engineer/blogs/09/log-without-losing-context" rel="nofollow">https://www.komu.engineer/blogs/09/log-without-losing-contex...</a>
it’s not.<p>stuff logs into s3. learn to mine them in parallel using lambda or ec2 spot. grow tech or teams as needed for scale. never egress data and never persist data outside of the cheapest s3 tiers. expire data on some sane schedule.<p>data processing is fun, interesting, and valuable. it is core to understanding your systems.<p>if you can’t do this well, there is probably a lot more you can’t do well either. in that case, life is going to be very expensive.<p>it’s ok to not do this well yet! spend some portion of your week doing this and you will improve quickly.