All you need is Wide Events, not "Metrics, Logs and Traces"

267 pointsby talborenabout 1 year ago

32 comments

Osmoseabout 1 year ago

This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling.

评论 #39531054 未加载

评论 #39534340 未加载

评论 #39531177 未加载

评论 #39546391 未加载

wiseguyehabout 1 year ago

While I don't have an opinion on wide events (AKA spans) replacing logs, there are benefits to metrics that warrant their existence:1. They're incredibly cheap to store. In Prometheus, is may cost you as little as 1 byte per sample (ignoring series overheads). Because they're cheap, you can keep them for much longer and use them for long-term analysis of traffic, resource use, performance, etc. Most tracing vendors seem to cap storage at 1-3 months while metric vendors can offer multi-year storage.2. They're far more accurate that metrics derived from wide events in higher-throughput scenarios. While wide events are incredibly flexible, their higher storage cost means there's an upper limit on the sample rate. The sampled nature of wide events means that deriving accurate counts is far more difficult- metrics really shine in this role (unless you're operating over datasets with very high cardinality). The problem only gets worse when you combine tail sampling into the mix and add bias towards errors/ slow requests in your data.

评论 #39532410 未加载

评论 #39532244 未加载

评论 #39533214 未加载

fnordpigletabout 1 year ago

This is essentially Amazon Coral’s service log format except service logs include cumulative metrics between log events. This surfaces in cloudwatch logs as metrics extraction and Logs Insights as structured log queries. The meta scuba is like a janky imitation of that tool chainPeople point to Splunk and ELK but they fail to realize that inverted index based solutions algorithmically can’t scale to arbitrary sizes. I would rather point people to Grafana Loki and CloudWatch Logs Insights and the compromises they entail as not just the right model for “wide events” or structured logging based events and metrics. Their architectures allow you to scale at low costs to PB or even exabyte scale monitoring.

评论 #39535050 未加载

评论 #39532697 未加载

评论 #39535859 未加载

评论 #39539510 未加载

评论 #39532965 未加载

infogulchabout 1 year ago

This thread has a lot of discussion about Wide Events / Structured Logs (same thing) being too big at scale, and you should use metrics instead.Why does it have to be an either/or thing? Couldn't you hook up a metrics extractor to the event stream and convert your structured logs to compact metrics in-process before expensive serde/encoding? With this your choice doesn't have to affect the code, just write slogs all the time; if you want structured logs then then output them, but if you only want metrics then switch to the metrics extractor slog handler.Futher, has nobody tried writing structured logs to parquet files and shipping out 1MB blocks at once? Way less serde/encoding overhead, and column oriented layout compresses like crazy with built-in dictionary and delta encodings.

评论 #39534857 未加载

评论 #39534905 未加载

评论 #39534874 未加载

jeffbeeabout 1 year ago

The isomorphism of traces and logs is clear. You can flatten a trace to a log and you can perfectly reconstruct the trace graph from such a log. I don't see the unifying theme that brings metrics into this framework, though. Metrics feels fundamentally different, as a way to inspect the internal state of your program, not necessarily driven by exogenous events.But I definitely agree with the theme of the article that leaving a big company can feel like you got your memory erased in a time machine mishap. Inside a FANG you might become normalized to logging hundreds of thousands of informational statements, per second, per core. You might have got used to every endpoint exposing thirty million metric time series. As soon as you walk out the door some guy will chew you out about "cardinality" if you have 100 metrics.

评论 #39531194 未加载

评论 #39531355 未加载

mikpankoabout 1 year ago

It took the world decades to develop widely accepted standards for working with relational data and SQL. I believe we are at the early stages of doing the same with event data and sequence analytics. It is starting to simultaneously emerge in many different fields:- eng observability (traces at Datadog, Sumologic, etc)- operational research (process mining at Celonis)- product analytics (funnels at Amplitude, Mixpanel)As with every new field, there are a lot of different and overlapping terms being suggested and explored at the same time.We are trying to contribute to the field with a deep fundamental approach at Motif Analytics, including a purpose-built set of core sequence operations, rich flow visualizations, a pattern matching query engine, and foundational AI models on event sequences [1].Fun fact: creators of Scuba turned it into a startup Interana (acquired by Twitter), who we took a lot of inspiration from for Motif's query engine.[1] <a href="https://motifanalytics.com" rel="nofollow">https://motifanalytics.com</a>

评论 #39548278 未加载

timthelionabout 1 year ago

At the company I work for we send json to kafka and subsiquently to Elastic search with great effect. That's basically 'wide events'. The magical thing about hooking up a bunch of pipelines with kafka is that all of a sudden your observability/metrics system becomes an amazing API for extending systems with aditional automations. Want to do something when a router connects to a network? Just subscribe to this kafka topic here. It doesn't matter that the topic was origionally intended just to log some events. We even created an open source library for writing and running these,pipelines in jupyter. Here's a super simple example <a href="https://github.com/bitswan-space/BitSwan/blob/master/examples/Jupyter/Kafka2Kafka/main.ipynb">https://github.com/bitswan-space/BitSwan/blob/master/example...</a>People tend to think kafka is hard, but as you can see from the example, it can be extremely easy.

评论 #39531172 未加载

评论 #39531089 未加载

评论 #39531286 未加载

treflopabout 1 year ago

We use wide events at work (or really “structured logging” or really “your log system has fields”) and they are great.But they aren’t a replacement for metrics because metrics are so god damn cheap.And while I’ve never used a log system with traces, every logging setup I’ve ever used has had request/correlation IDs to generate a trace because sometimes you just wanna lookup a flow and see it without spending a time digging through wide events/your log system. If you aren’t looking up logs very often, then yeah it seems browsing through structured logs isn’t that bad but then do it often and it’s just annoying…

zug_zugabout 1 year ago

This person is simply misinformed. I worked at meta and used scuba, and it's like 6/10 (which makes it one of meta's best tools).A tool like splunk can do everything scuba can do and a million things it can't. Sumologic can too.The reason that splunk/sumologic are so much better than scuba is that they have open-ended query languages rather than this on-rails "only ever do one group-by". Just for example, if you wanted to dynamically extract a field at query-time based on the difference of two values and group by that, that's something you can do trivially in splunk/sumo.I could write a whole essay on the topic really, but the gist of it is you need a full-scaled open-ended language for advanced querying because 1% of the time you need to do weird stuff like count-by -> a second count-by.What I will agree with is that traces/metrics do not inherently give you this ability, but absolutely traces could if there was a platfom with a powerful enough query-language for it (e.g. give me all requests that go through 4 services, have errors on service 3 but not 4, and are associated with userId 123 on service 1)

评论 #39532113 未加载

评论 #39532874 未加载

评论 #39532363 未加载

评论 #39532232 未加载

评论 #39536888 未加载

评论 #39532247 未加载

karmakazeabout 1 year ago

What's the difference between a Wide Event and a structured log?

评论 #39531995 未加载

评论 #39533741 未加载

TeeWEEabout 1 year ago

This is basically a metric with tags. Only difference is that a metric has a main unit it measures.In the end anything can be represented as a structured log.A span is NOT what the OP calls a “system wide event”. A span has a begin and end time. What he/she describes doesn’t have that.In the end giving different kind of instrumentation instruments a name makes sense, mainly for processing them / rendering them / altering on them.

tlarkworthyabout 1 year ago

FWIW I think x-ray has everything you need, its just that AWs tooling does not give you much ability to aggregate over x-ray bundles. I wrote a tool to help bulk load x-ray samples into a local browser duckdb and then slice and dicing in realtime interactive visualisations. It also includes the ability to generate a flamegraph over the selected traces. All this great data is already in an AWS, account and we just need better tools to make use of it.<a href="https://observablehq.com/@tomlarkworthy/x-ray-slurper" rel="nofollow">https://observablehq.com/@tomlarkworthy/x-ray-slurper</a>

gorlamiabout 1 year ago

So, what is the closest thing in the open source world to what the author describes? (Setting aside the question of is it right for you, which, of course, depends.)

评论 #39532263 未加载

评论 #39535665 未加载

评论 #39532476 未加载

swader999about 1 year ago

This seems like event sourcing with a nice tool to inspect, filter and visualize the event stream. The sampling rate idea is a decent tactic I hadn't heard of.

jrockwayabout 1 year ago

I like logs. Unlike most people selling and using observability platforms, most of the software I write is run by other people. That means it can't send me traces and I can't scrape it for metrics, but I still have to figure out and fix their problems. To me, logs are the answer. Logs are easy to pass around, and you can put whatever you want in there. I have libraries for metrics and traces, and just parse them out of the logs when that sort of presentation would be useful. (Yes, we do sampling as well.)I keep hearing that this doesn't scale. When I worked at Google, we used this sort of system to monitor our Google Fiber devices. They just uploaded their logs every minute (stored in memory, held in memory after a warm reboot thanks to a custom linux kernel with printk_persist), and then my software processed them into metrics for the "fast query" monitoring systems. The most important metrics fed into alerts, but it didn't take very much time to just re-read all the logs if you wanted to add something new. Amazingly, the first version of this system ran on a single machine... 1 Go program handling 10,000qps of log uploads and analysis. I eventually distributed it to survive machine and datacenter failures, but it ultimately isn't that computationally intensive. The point is, it kind of scales OK. Up to 10s of terabytes a day, it's something you don't even have to think about except for the storage cost.At some point it does make sense to move things into better databases than logs; you want to be alerted by your monitoring system that 99%-ile latency is high, then look in Jaeger for long-running traces, then take the trace ID and search your logs for it. If you start with logs, you have that capability. If you start with something else, then you just have "the program is broken, good luck" and you have to guess what the problem is whenever you debug. Ideally, your program would just tell you what's broken. That's what logs are.One place where people get burned with logs is not being careful about what to log. Logs are the primary user interface for operators of your software (i.e. you during your oncall week), and that task deserves the attention that any other user interface task demands. People often start by logging too much, then get tired of "spam", and end up not logging enough. Then a problem occurs and the logs are outright misleading. (My favorite is event failures that are retried, but the retry isn't logged anywhere. You end up seeing "ERROR foobar attempt 1/3 failed" and have no idea of knowing that attempt 2/3 succeeded a millisecond after that log line.)For the gophers around, here's what I do for traces: <a href="https://github.com/pachyderm/pachyderm/blob/master/src/internal/log/span.go#L130">https://github.com/pachyderm/pachyderm/blob/master/src/inter...</a> and metrics: <a href="https://github.com/pachyderm/pachyderm/blob/master/src/internal/meters/meters.go">https://github.com/pachyderm/pachyderm/blob/master/src/inter...</a>. If you have a pipeline for storing and retrieving logs (which is exactly the case for this particular piece of software), now you have metrics and traces. It's great! I just need to write the thing to turn a set of log files into a UI that looks like Jaeger and Prometheus ;) My favorite part is that I don't need to care about the cardinality of metrics; every RPC gets its own set of metrics. So I can write a quick jq program to figure out how much bandwidth the entire system is using, or I can look at how much bandwidth one request is using. (meters logs every X bytes, and log entries have timestamps.)I think since we've added this capability to our system, incidents are most often resolved with "that's fixed in the next patch release" instead of multiple iterations "can you try this custom build and take another debug dump". Very enjoyable.

评论 #39541795 未加载

goosejuiceabout 1 year ago

This sounds like a privacy nightmare as described if there aren't guardrails. 'Dump everything'Can pretty easily achieve this with structured logging in GCP with their metrics explorer. Pretty cheaply I might add. Sentry can also do a bit of this if you're on something like fly.io (they offer a year free).I don't think either would completely replace tracing in a complex system for me. At least not in the context Ive worked.

mdavidnabout 1 year ago

The SQL-equivalent in the sampling rate example should sum the inversion:SELECT SUM(1 / samplingRate) FROM AdImpressions WHERE IsTest = False

4ndrewlabout 1 year ago

Wide events are fine until someone puts personally identifiable information (PII) in them. Then you're in a bit of a mess as you've presumably taken PII out of an environment with one set of access controls, and into a separate, different environment, with access controls that are for a different purpose than required by the data.

评论 #39531513 未加载

Scotrixabout 1 year ago

Using the ELK stack for almost a decade to have somewhat wide events + no sampling, on not Meta scale and a few GB/day make it absolutely affordable and super fast. Unfortunately Kibana was a bit better/easier in the old versions than nowadays but it’s still pretty straight forward to get everything out of it.

veeralpatel979about 1 year ago

Great article, here is a Python notebook I created earlier to show you how you can capture such wide events:<a href="https://colab.research.google.com/drive/1Y65qXXogoDgOnXFBDyFsW2EPsJRUf8_J?usp=sharing" rel="nofollow">https://colab.research.google.com/drive/1Y65qXXogoDgOnXFBDyF...</a>

zemoabout 1 year ago

begging people to recognize that a person who sells a solution is going to view these problems through the lens of being rewarded for applying their solution to your problem, even if it's not appropriate.> Yet, per my own experience it’s still extremely hard to explain what does Charity meant by “logs are thrash”, let alone the fact that logs and traces are essentially the same things. Why is everyone so confused?Charity is not confused, Charity is incentivized. What she means by "logs are trash" is "I do not sell a logging product". (and, to be clear, I'm only naming Charity individually here because that's who the author named in their article.)> When I was working at Meta, I wasn’t aware that I was privileged to be using the best observability system ever.The observability system that is appropriate for Meta is not necessarily appropriate for your project. Those tools are cool but also require a pretty serious investment to build and operate correctly. It's very easy to wade into a cardinality explosion problem when tagging and indexing everything you can imagine, it's very easy to wade into problems regarding mixed retention policies when some events are important and others are less-important, it's very easy to wade into a latency-sensitivity issue if you're building a log/event collection infra that you don't allow to ever lose data, etc. As it turns out, observability is a large topic.The idea that there's one "best" way to do observability is a little ridiculous. Like when I worked at Etsy some of the data was literally money, when I worked at Jackbox Games we made fart joke games (Quiplash, Drawful, Fibbage, You Don't Know Jack, etc) and the infrastructure was nothing but pure cost. The observability needs of those two orgs were phenomenally different, because the products were different, the revenue models were different, the needs of the users were different, etc.Also this notion that "all you need is wide events" is the answer seems ... really shallow. A data point is an unordered set of key-value pairs? That's how ... a LOT of logging, metrics, and tracing infra expresses things at the level of an individual record/event. The difference is in the relationships between the keys and values, the relationships between the individual records, etc.and "stop sampling" is just a bizarre marketing angle. If you have 1 million records or 10 million records and you get the same squiggly line out of analyzing it, congrats you have inflated the size of the data that nobody ever looks at. There is only one person who this benefits and it's the person who charges you for the pipeline, which is exactly why people who sell a pipeline are incentivized to tell you that sampling is bad: if you are sampling, you are sending and storing and querying fewer data points, so they are charging you less money. They are getting paid to tell you that sampling is bad. Sampling is not good or bad, sampling is sampling. The reality is that in a lot of these systems, the vast majority of the information will never, ever be looked at or used. Whether or not that matters is entirely context dependent.

评论 #39532237 未加载

评论 #39532229 未加载

pranabgohainabout 1 year ago

You could also use a unified, OTel native platform like <a href="https://www.kloudmate.com" rel="nofollow">https://www.kloudmate.com</a> instead of setting up Grafana, Prometheus, Loki, separately.

jhardy54about 1 year ago

Is this not just structured logging? I’m wondering whether the author has used tracing tools much, or whether they’re truly trying to understand modern observability through OpenTelemetry documentation.

npalliabout 1 year ago

Man, this whole "All you need is XYZ" framing is turning out to the as irritating and overused as "XYZ considered harmful" from back in the day.

renewiltordabout 1 year ago

Has anyone built an open-source version of this and have a blog post around it? Curious about implementation to see how you keep storage tight and querying still fast.

评论 #39532453 未加载

abhisgupabout 1 year ago

New Relic supports this in the form of custom events. I have used it and it works but is very expensive. An alternative is to use ClickhouseDB directly.

评论 #39547802 未加载

jupp0rabout 1 year ago

This looks like structured logging and piping those logs to Splunk or am I missing something?

blindedabout 1 year ago

otel is working on their events spec <a href="https://github.com/open-telemetry/community/issues/1688">https://github.com/open-telemetry/community/issues/1688</a>

评论 #39532289 未加载

ojkellyabout 1 year ago

Observability as a shared concept has followed Agile and DevOps.Something with a real meaning that is enables a step-change is development practices. Adoption is organic initially because the pain it solves is very real.But as awareness of the idea grows it threatens established institutions and vendors, who must co-opt the concept and redefine it such that they are included.If they can’t be explicitly included (logs, metrics, traces)[0], then they at least make sure the definition is becomes so vague and confused that they are not explicitly excluded[1].Wide events and a good means to query them covers everything, but not if you as a vendor cannot store and query wide events.[0] as the article notes, one of these is not like the other. [1] Is Scrum Agile? What do you mean a standup can’t go for an hour? See also DevOps as a role.

评论 #39531120 未加载

guhcamposabout 1 year ago

Incredible what you can do with infinite money!For everyone else, more specific data structures, sampling and careful consideration of what to record are essential.

评论 #39531466 未加载

评论 #39531820 未加载

评论 #39532252 未加载

rekwahabout 1 year ago

> just put it there, it might be useful later> Also note that we have never mentioned anything about cardinality. Because it doesn’t matter - any field can be of any cardinality. Scuba works with raw events and doesn’t pre-aggregate anything, and so cardinality is not an issue.This is how we end up with very large, very expensive data swamps.

评论 #39531146 未加载

评论 #39531294 未加载

nemo44xabout 1 year ago

The best part of this post is where they quote a failed SaSS trying to explain why successful SaSS is wrong. Anything for an edge even if it’s not useful.

评论 #39533013 未加载