Tracing: Structured logging, but better

263 点作者 pondidum超过 1 年前

29 条评论

zoogeny超过 1 年前

One thing about logging and tracing is the inevitable cost (in real money).I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.

评论 #37586829 未加载

评论 #37587030 未加载

评论 #37587710 未加载

评论 #37588701 未加载

评论 #37591001 未加载

评论 #37598546 未加载

评论 #37594106 未加载

评论 #37593557 未加载

评论 #37587389 未加载

评论 #37595079 未加载

layer8超过 1 年前

> Log Levels are meaningless. Is a log line debug, info, warning, error, fatal, or some other shade in between?I partly agree and disagree. In terms of severity, there are only three levels:– info: not a problem– warning: potential problem– error: actual problem (operational failure)Other levels like “debug” are not about severity, but about level of detail.In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component. Thus the severity has to be interpreted relative to the source component.The latter can be an issue if the severity is only interpreted globally. Either it will be wrong for the global level, or subcomponents have to know the global context they are running in to use the severity appropriate for that context. The latter causes undesirable dependencies on a global context. Meaning, the developer of a lower-level subcomponent would have to know the exact context in which that component is used, in order to chose the appropriate log level. And what if the component is used in different contexts entailing different severities?So one might conclude that the severity indication is useless after all, but IMO one should rather conclude that severity needs to be interpreted relative to the component. This also means that a lower-level error may have to be logged again in the higher-level context if it’s still an error there, so that it doesn’t get ignored if e.g. monitoring only looks at errors on the higher-level context.Differences between “fatal” and “error” are really nesting differences between components/contexts. An error is always fatal on the level where it originates.

评论 #37590859 未加载

评论 #37588747 未加载

评论 #37591084 未加载

评论 #37588400 未加载

fnordpiglet超过 1 年前

Tracing is poor at both very long lived traces, at stream processing, and most tracing implementations are too heavy to run in computationally bound tasks beyond at a very coarse level. Logging is nice in that it has no context, no overhead, is generally very cheap to compose and emit, and with including transaction id and done in a structured way gives you most of what tracing does without all the other baggage.That said for the spaces where tracing works well, it works unreasonably well.

评论 #37587806 未加载

评论 #37586199 未加载

评论 #37586147 未加载

alkonaut超过 1 年前

I like a log to read like a book if it’s the result of a task taking a finite time, such as for example an installation, a compilation, a loading of a browser page or similar. Users are going to look into it for clues about what happened and they a) aren’t always related to those who wrote the tools b) don’t have access to the source code or any special log analytics/querying tools.That’s when you want a log and that’s what the big traditional log frameworks were designed to handle.A web backend/service is basically the opposite. End users don’t have access to the log, those who analyze it can cross reference with system internals like source code or db state and the log is basically infinite. In that situation a structured log and querying obviously wins.It’s honestly not even clear that these systems are that closely related.

评论 #37589414 未加载

mrkeen超过 1 年前

> If you’re writing log statements, you’re doing it wrong.I too use this bait statement.Then I follow it up with (the short version):1) Rewrite your log statements so that they're machine readable2) Prove they're machine-readable by having the down-stream services read them instead of the REST call you would have otherwise sent.3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.Voila, you got yourself a reactive, event-driven system with accurate "logs".If you're like me and you read the article thinking "I like the result but I hate polluting my business code with all that tracing code", well now you can create an independent reader of your kafka events which just focuses on turning events into traces.

评论 #37589123 未加载

评论 #37589141 未加载

crabbone超过 1 年前

> The second problem with writing logs to stdoutWho on Earth does that? Logs are almost always written to stderr... In part to prevent other problems author is talking about (eg. mixing with the output generated by the application).I don't understand why this has to be either or... If you store the trace output somewhere you get a log... (let's call it "un-annotated" log, since trace won't have the human-readable message part). Trace is great when examining the application interactively, but if you use the same exact tool and save the results for later you get logs, with all the same problems the author ascribes to logs.

评论 #37586885 未加载

评论 #37590407 未加载

评论 #37589653 未加载

benreesman超过 1 年前

As a historical critic of Rust-mania (and if I’m honest, kind of an asshole about it too many times, fail), I’ve recently bumped into stuff like tokio-tracing, eyre, tokio-console, and some others.And while my historical gripes are largely still the status quo: stack traces in multi-threaded, evented/async code that actually show real line numbers? Span-based tracing that makes concurrent introspection possible by default?I’m in. I apologize for everything bad I ever said and don’t care whatever other annoying thing.That’s the whole show. Unless it deletes my hard drive I don’t really care about anything else by comparison.

hardwaresofton超过 1 年前

I think there's an alternate universe out there where:- we collectively realized that logs, events, traces, metrics, and errors are actually all just logs- we agreed on a single format that encapsulated all that information in a structured manner- we built firehose/stream processing tooling to provide modern o11y creature comfortsI can't tell if that universe is better than this one, or worse.

评论 #37589612 未加载

评论 #37589568 未加载

评论 #37589611 未加载

jeffbee超过 1 年前

This is a great article because everyone should understand the similarity between logging and tracing. One thing worth pondering though is the differences in cost. If I am not planning to centrally collect and index informational logs, free-form text logging is extremely cheap. Even a complex log line with formatted strings and numbers can be emitted in < 1µs on modern machines. If you are handling something like 100s or 1000s of requests per second per core, which is pretty respectable, putting a handful of informational log statements in the critical path won't hurt anyone.Off-the-shelf tracing libraries on the other hand are pretty expensive. You have one additional mandatory read of the system clock, to establish the span duration, plus you are still paying for a clock read on every span event, if you use span events. Every span has a PRNG call, too. Distributed tracing is worthless if you don't send the spans somewhere, so you have to budget for encoding your span into json, msgpack, protobuf, or whatever. It's a completely different ball game in terms of efficiency.

评论 #37588436 未加载

评论 #37590456 未加载

评论 #37589210 未加载

perpil超过 1 年前

I was recently musing about the 2 different types of logs:1. application logs, emitted multiple times per request and serve as breadcrumbs2. request logs emitted once per request and include latencies, counters and metadata about the request and responseThe application logs were useless to me except during development. However the request logs I could run aggregations on which made them far more useful for answering questions. What the author explains very well is that the problem with application logs is they aren't very human-readable which is where visualizing a request with tracing shines. If you don't have tracing, creating request logs will get you most of the way there, it's certainly better than application logs. <a href="https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging-for-scale.html" rel="nofollow noreferrer">https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...</a>

评论 #37593742 未加载

ducharmdev超过 1 年前

Minor nitpick, but I wish this post started with defining what we mean by logging vs tracing, since some people use these interchangeably. The reader instead has to infer this from the criticisms of logging.

评论 #37585980 未加载

评论 #37587454 未加载

waffletower超过 1 年前

There are logging libraries that include syntactically scoped timers, such as mulog (<a href="https://github.com/BrunoBonacci/mulog">https://github.com/BrunoBonacci/mulog</a>). While a great library, we preferred timbre (<a href="https://github.com/taoensso/timbre">https://github.com/taoensso/timbre</a>) and rolled our own logging timer macro that interoperates with it. More convenient to have such niceties in a Lisp of course. Since we also have OpenTelemetry available, it would also be easy to wrap traces around code form boundaries as well. Thanks OP for the idea!

goalieca超过 1 年前

Logging is essential for security. I think tracing is wonderful and so are metrics. I see these as more of a triad for observability.

评论 #37585634 未加载

gazpacho超过 1 年前

One big failing of OpenTelemetry's traces in particular is that attaching structured data to them is difficult. Most structured logs can be JSON which for all of it's faults most things can be serialized to JSON. OpenTelemetry's attributes on traces are much more limited, they don't even support a null/None value! I wish they just accepted JSON-like data, it'd make it much easier to always use traces.

h1fra超过 1 年前

Tracing is much more actionnable but barely usable without a platform. Which makes local programming dependent on third party. Also it requires passing context or have a way to get back the context in every function that requires it, which can be daunting.On my side I have opted to mixed structured/text, a generic message that can be easily understood while glancing over logs, and a data object attached for more details.

评论 #37585738 未加载

评论 #37585681 未加载

vkoskiv超过 1 年前

Nit to the author: 'rapala' seems like a mistranslation. It is a brand name of a company that makes fishing lures, as far as I can tell. It is not the Finnish word for "to bait", and is therefore only used to refer to a that particular brand. I'm not sure what the purpose of the text in parenthesis is here, but 'houkutella' would be the most apt translation in this case.

评论 #37598429 未加载

jauntywundrkind超过 1 年前

What's most incredible to me is how close tracing feels in spirit to me to event-sourcing.Here's this log of every frame of compute going on, plus data or metadata about the frame.... but afaik we have yet to start using the same stream of computation for business processes as we do for it's excellent observability.

评论 #37586991 未加载

评论 #37588335 未加载

koliber超过 1 年前

Does this naive approach work for anyone to allow a log to be read like a trace:1. At the start of a request, generate a globally unique traceId2. Pass this traceId through the whole call stack.3. Whenever logging, log the traceId as a parameterNow you have a log with many of the plusses of a trace. The only additional cost to the log is the storage of the traceId on every message.If you want to read a trace, search through your logs for "traceId: xyz123". If you use plain text storage you can grep. If you use some indexed storage, search for the key-value pair.This way, you can retrieve something that looks like a trace from a log.This does not solve all the issues named in the article. However, it is a decent tradeoff that I've used successfully in the past. Call it "poor man's tracing".

评论 #37598458 未加载

skybrian超过 1 年前

How would a hobbyist programmer get started with tracing for a simple web app? Where do the traces end up and how do I query it? Can tracing be used in a development environment?Context: the last thing I wrote used Deno and Deno Deploy.

评论 #37590186 未加载

spullara超过 1 年前

It drives me insane that the standardized tracing libraries have you only report closed spans. What if it crashes? What if it stalls? Why should I keep open spans in memory when I can just write an end span event?

andersrs超过 1 年前

I have a side project that I run in Kubernetes with a postgres database and a few Go/Nodejs apps. Recommend me a lightweight otel backend that isn't going to blow out my cloud costs.

hosh超过 1 年前

That’s weird. I use both logging and tracing where I can. And metrics.While there are better tools for alerting, metrics, or aggregations, it helps a lot in debugging and troubleshooting.

评论 #37586974 未加载

marcus_holmes超过 1 年前

I'm fundamentally uncomfortable with sending all my data to a third party.The cool thing about logs is that they're just a text file and don't need to be sent over the internet to someone else. But yes, I've encountered some problems just using text logs and I'd like to solve them.Is there an OpenTelemetry solution that is capable of being self-hosted (and preferably OS) that anyone recommends?

评论 #37607818 未加载

评论 #37593642 未加载

jasonjmcghee超过 1 年前

I really enjoyed the content- it's a great article.Note to author: all but the last code block have a very odd mixture of rather large font sizes (at least on mobile) which vary line to line that make them pretty difficult to read.Also the link to "Observability Driven Development." was a blank slide deck AFAICT

评论 #37598961 未加载

amelius超过 1 年前

This is stuff that a debugger is supposed to do for you, for free.This should not require code at the application level, but it should be implemented at the tooling level.

评论 #37607919 未加载

评论 #37591272 未加载

lambda_garden超过 1 年前

Couldn't this be injected into the runtime so that no code changes are required?Perhaps really performance critical stuff could have a "notrace" annotation.

评论 #37587599 未加载

评论 #37587550 未加载

评论 #37587271 未加载

评论 #37600804 未加载

killbot5000超过 1 年前

Logs should go to stderr. I will die on this hill.

hello1234567超过 1 年前

person writing this came to know some thing that he din't know earlier and decided to convert his light bulb moment into a blog post. not bad bad but failed to understand that logs are the generalisation of very thing they are talking about.

thegrizzlyking超过 1 年前

Logs are mostly "Hi I reached this line of code, here is some metadata"