Do we need to store all that telemetry?

102 点作者 mklein123大约 1 年前

15 条评论

h1fra大约 1 年前

I understand the point but I also advocate for the opposite, it's not cool for the planet for sure but having all the data points for at least a couple of months is very useful on any large system and +15months for metrics so you can compare with the year before.I can't count the number of times users (or myself) discovered bug after many weeks because something gradually failed over time. Also it saves a lot of time to be able to pin point the exact day a behavior as changed so you can check the deploy of that day and quickly find the bug. Sometimes a trend is not obvious after a deploy but is clearly visible on the graph after a long period of time.And for business intelligence, it's always when you badly need a metric that you realize you never tracked it.

评论 #40044572 未加载

jcgrillo大约 1 年前

Another facet of this is how do we store telemetry data? Fully indexed instantaneously searchable seems to be the "default" these days but who actually needs that?I keep harping on this, but compressed utf-8 text (or even worse, compressed json) is a horribly wasteful way to do it. See [1]. Putting a small amount of thought into storing telemetry data seems like it could yield incredible savings at scale.[1] <a href="https://lists.w3.org/Archives/Public/www-logging/1996May/0000.html" rel="nofollow">https://lists.w3.org/Archives/Public/www-logging/1996May/000...</a>

评论 #40045591 未加载

评论 #40042756 未加载

mountainriver大约 1 年前

Great post, the observability folks have gone off the rails in the last 5 years. I’ve seen it do more harm than good in terms of dev speed and ironically often make things less observable for the common path.

klabb3大约 1 年前

Isn’t the issue more that off-the-shelf solutions optimize for features and not cost? For instance, if I sell you an observability product, I want to show off all the cool realtime debugging features and such. And since there’s a cost to having all these features available (retention, indexing, sampling), we end up paying for features we don’t need. In a world of usage-based XaaS, there’s very little incentive to be cost-effective. Arguably even a perverse incentive to waste resources.I bet you a full dollar that both in-house and open source solutions, on average, are way more stingy with resources. As they should be.

评论 #40045610 未加载

blobcode大约 1 年前

> we can also add local storage of telemetry data in an efficient circular buffer. Typically, local storage is cheap and underutilized, allowing for “free” storage of a finite amount of historical data, that wraps automatically. Local storage provides the ability to “time travel” when a particular event is hit.I think that this is a good idea when storage is concern for high-volume logs / production. Persisting the buffer when high error rates / unusual system behavior is observed would be a cool idea.

评论 #40042595 未加载

jauntywundrkind大约 1 年前

We've turned off logging & tracing on a bunch of our high volume routes. Ideally I'd prefer we still sample them, at like 0.1% or what not, to give us some indicator, some chance of seeing anomalies. It just seems easier to gather & use this information than it is to go develop a suite of metrics that can register all issues.OpenTelemetry recently ish gained Open Agent Management Protocol (OpAMP), which allows some runtime control over things generating telemetry. The ability to stay fairly low but then scale up as needed sounds tempting, but gee it also sends shivers down my spine thinking of having such a elastic demands on one's telemetry infrastructure, as engineers turn telemetry up as problems are occuring. <a href="https://opentelemetry.io/docs/specs/opamp/" rel="nofollow">https://opentelemetry.io/docs/specs/opamp/</a>The idea of having a local circular buffer sounds excellent to me. Being able to run local queries & aggregate would be sweet. Are there any open otel issues discussing these ideas?

wrs大约 1 年前

We continue to recreate features of single-computer OSes in distributed systems. This seems like the dtrace/bpftrace of microservices world.

gghffguhvc大约 1 年前

“A lot of telemetry doesn’t need to be stored for very long” is the attitude I take. Keeps costs down but gives good visibility.

zug_zug大约 1 年前

I think most places don't collect enough telemetry in the right formats.It's also possible they collect too much in the wrong formats.But the ability to vet a hypothesis (I bet our users are confused about feature X, which we can test by looking at how many times they go to page X, then Y, then X again in 30 second window) in an hour versus 2 sprints is vastly underappreciated/underutilized.I feel like this article paints with too broad a brush.

m3047大约 1 年前

Agree with the article enough that I did something about it which I call "Poor Fred's SIEM". The heart of it is a DNS proxy for Redis (<a href="https://github.com/m3047/rkvdns">https://github.com/m3047/rkvdns</a>). However it's not targeted at environments where everything is in a "bubble" such that there are no ingress / egress costs. (Lookin' at you, Cloud.) Furthermore "control plane" is an important concept, and it's well understood in the industrial control world as the Purdue Model.From a systems standpoint do you need to have all resources stored centrally in order to do centralized reporting? No, of course not. Admittedly it's handy if bandwidth and storage are free. The alternative is distributed storage, with or without summarization at the edge (and aggregating from distributed storage for reporting).Having it distributed does raise access issues: access needs to be controlled, and management of access needs to be managed. Philosophically the Cloud solutions sell centralized management, but federation is a perfectly viable option. The choice is largely dictated by organizational structure not technology.There is also a difference between diagnostic and evaluative indicators. Trying to evaluate from diagnostics causes fatigue because humans aren't built that way; evaluatives can and should be built from diagnostics. Diagnostics can't be built from evaluatives.The logging/telemetry stack that I propose is:1) Ephemeral logging at the limits of whatever observability you can build. E.g.: systemd journal with a small backing store, similar to a ring buffer.2) Your compliance framework may require shipping some classes of events off of the local host, but I don't think any of them require shipping it to the cloud.3) Build evaluatives locally in Redis.4) Use DNS to query those evaluatives from elsewhere for ad hoc as well as historical purposes. This could be a centralized location or it could be true federation where each site accesses all other site's evaluatives.I wouldn't put Redis on the internet, but I don't worry too much about DNS; and there are well-understood ways of securing DNS from tampering, unauthorized access, and even observation. By the way, DNS will handle hundreds or thousands of queries per second you just have to build for it.

评论 #40043827 未加载

yetanotherdood大约 1 年前

> For 30 years how telemetry is produced has not changed: we define all of the data points that we need ahead of time and ship them out of the origin process, typically at large expense. If we apply the control plane / data plane split to observability telemetry production we can fundamentally change the status quo for the first time in three decadesHas Matt read any prior art in this field? <a href="https://research.google/pubs/monarch-googles-planet-scale-in-memory-time-series-database/" rel="nofollow">https://research.google/pubs/monarch-googles-planet-scale-in...</a>

jedberg大约 1 年前

NO! You don't!I couldn't agree with the author more. Keeping historical records of business metrics makes a ton of sense. But history telemetry (CPU, Memory, Network, error logs) makes little sense.If an issue occurs, then turn on telemetry around that issue until you track it down. If an issue occurs once and never again, did it really matter? This obviously does not apply to security, I'm just speaking of operational issues.Keeping all of your application logs and telemetry forever is expensive, and I can't recall a single time when having more than a day's with of history was ever useful in tracking down an operational issue.

评论 #40042047 未加载

评论 #40043227 未加载

评论 #40043017 未加载

评论 #40042322 未加载

binary132大约 1 年前

In general I think many programmers have internalized the idea that it’s best to waste as many computing resources as we can possibly afford as long as it’s not the bottleneck. Then, in the future, if and when it becomes the bottleneck, we’ll have plenty of headroom to optimize and look like heroes for saving the millions of dollars we never had to spend in the first place. It’s really insane (at best) or genuinely a type of grift at worst.

thisislife2大约 1 年前

"Data is the new oil" - if you don't collect your customer data, and treat it as an asset, you are guilty of mismanagement . /s

评论 #40048121 未加载

评论 #40042799 未加载

murat124大约 1 年前

YES you do. BUT with varying retention periods for each a) environment b) region c) function d) criticality e) metric namespace/name f) team etc.Nobody needs to retain metrics like CPU, Memory for weeks but I may want to see their numbers during an incident, or not long after it is over.