Monitoring demystified: A guide for logging, tracing, metrics

487 pointsby malechimpalmost 5 years ago

12 comments

buro9almost 5 years ago

A lot of excellent information in that blog post and linked from it... but if you're wondering where to start:1. Write good logs... not too noisy when everything is running well, meaningful enough to let you know the key state or branch of code when things deviate from the good path. Don't worry about structured vs unstructured too much, just ensure you include a timestamp, file, log level, func name (or line number), and that the message will help you debug.2. Instrument metrics using Prometheus, there are libraries that make this easy: <a href="https://prometheus.io/docs/instrumenting/clientlibs/" rel="nofollow">https://prometheus.io/docs/instrumenting/clientlibs/</a> . Counts get you started, but you probably want to think in aggregation and to ask about the rate of things and percentiles. Use histograms for this <a href="https://prometheus.io/docs/practices/histograms/" rel="nofollow">https://prometheus.io/docs/practices/histograms/</a> . Use labels to create a more complex picture, i.e. A histogram of HTTP request times with a label of HTTP method means you can see all reqs, just the POST, or maybe the HEAD, GET together, etc... and then create rates over time, percentiles, etc. Do think about cardinality of label values, HTTP methods is good, but request identifiers are bad in high traffic environments... labels should group not identify.Start with those things, tracing follows good logging and metrics as it takes a little more effort to instrument an entire system whereas logging and metrics are valuable even when only small parts of a system are instrumented.Once you've instrumented... Grafana Cloud offers a hosted Grafana, Prometheus metrics scraping and storage, and Log tailing and storage (via Loki) <a href="https://grafana.com/products/cloud/" rel="nofollow">https://grafana.com/products/cloud/</a> so you can see the results of your work immediately.If it's a big project, you have a lot of options and I assume you know them already, this is when you start looking at Cortex and Thanos, Datadog and Loki, tracing with Jaegar.

评论 #24007627 未加载

评论 #24007579 未加载

评论 #24007484 未加载

评论 #24010920 未加载

评论 #24007356 未加载

评论 #24012034 未加载

评论 #24012211 未加载

评论 #24007656 未加载

评论 #24015937 未加载

KaiserProalmost 5 years ago

A few things I have learnt along the way:Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action xAlerts must be actionable.Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.

评论 #24008093 未加载

评论 #24008694 未加载

评论 #24007381 未加载

评论 #24007060 未加载

评论 #24015984 未加载

评论 #24007712 未加载

评论 #24007269 未加载

评论 #24010380 未加载

评论 #24007050 未加载

评论 #24012671 未加载

dig1almost 5 years ago

The Art of Monitoring [1], covers most of these stuff in a unified manner.You are introduced to some basics (push vs. pull monitoring), then proceeded with simple system metrics collection (cpu, memory) via collectd, then goes to logs ingestion and ends up extracting application-specific metrics from jvm and python applications.I highly recommend it, even for seasoned professionals.[1] <a href="https://artofmonitoring.com/" rel="nofollow">https://artofmonitoring.com/</a>

gnufxalmost 5 years ago

I never see an important system management principle brought up: If you get a user complaint (for some value of "user") and not an alert, you should fix the monitoring system so that you don't get another occurrence of it or related problems. Obviously that's within reason, depending on the circumstances; the effort might not be worth it.

secondcomingalmost 5 years ago

We log extensively. Here are some of my thoughts it- at least in C++, the requirement to be able to log from pretty much anywhere can lead to messy code that either passes a reference to your logger to all classes that might possibly need it, or you've got an extern global somewhere. Yuck.- logging can enable laziness. Being able to log that something weird happened can be considered a sufficient substitute for proper testing.- logs are only as useful as the info they contain. This can mean state needs to be passed around all over the place just so that it can all be eventually logged on one line (it saves your data team from having to do a 'join')- if your logger doesn't support cycling log files it's useless. If something goes wrong you can easily fill a disk.

评论 #24016623 未加载

评论 #24008141 未加载

评论 #24012116 未加载

评论 #24009869 未加载

kasey_junkalmost 5 years ago

It’s weird to see the stuff by Jay Kreps (of Kafka ~fame~) listed in the logs section. His writing is specifically _not_ about logs the observability tool, but logs the data structure such as you’d see at the heart of a database.

评论 #24008772 未加载

评论 #24008295 未加载

say_it_as_it_isalmost 5 years ago

Is there an open source solution for processing streams of structured and unstructured logs and routing then onward? I see solutions for moving logs to elastic or Kafka but nothing for evaluating the log.

评论 #24009593 未加载

评论 #24007511 未加载

评论 #24007414 未加载

评论 #24010290 未加载

评论 #24009536 未加载

评论 #24009109 未加载

评论 #24007085 未加载

评论 #24007422 未加载

waihtisalmost 5 years ago

> Logging is critical to detecting attacks and intrusions.Yes, but not universally - and just collecting logs will not take you far. Logging everything and trying to approach security via the ’collect all data’ is both expensive and inaccurate, and one of the major inefficiencies in modern cyber.

评论 #24007431 未加载

FrontAidalmost 5 years ago

Recently, I was searching for a service which offers those functionalities on a very basic level. I tried several options and was really disappointed with all of them. The only one that I found to be usable was <a href="https://logdna.com/" rel="nofollow">https://logdna.com/</a>. I've now been using it for a couple of weeks and it works OK. It offers logging, alerts, metrics/dashboards, and some other things. And all that for a reasonable pricing.

xondonoalmost 5 years ago

Am I the only one that can’t reach the “save and exit” privacy button on mobile?It’s hard for me to think that this is not intentional when the “Accept all” is usable but the alternative isn’t...

notmalcalmost 5 years ago

Nice

anderspitmanalmost 5 years ago

If you don't need all the fancy metrics, and just want something simple to keep an eye on your services, alert you if they fail, and automatically restart them, check out my stealthcheck service. It's all of 150 lines of free range, 0-dependency go:<a href="https://github.com/anderspitman/stealthcheck" rel="nofollow">https://github.com/anderspitman/stealthcheck</a>