One of my pet peeves is "The Useless Use of cat Award". Someone awarded it to me as a teenager in the late 90s and I've been sore ever since.<p>Yup, it's often a waste of resources to run an extra 'cat'. It really demonstrates that you don't have the usage of the command receiving the output completely memorized. You know, the thousand or so commands you might be piping it into.<p>But, if you're doing a 'useless' use of cat, you're probably just doing it in an interactive session. You're not writing a script. (Or maybe you are, but even still, I bet that script isn't running thousands of times per second. And if it is, ok, time to question it).<p>So you're wasting a few clock cycles. The computer is doing a few billion of these per second? By the time you explain the 'useless' use of cat to someone, the time you wasted explaining to them why they are wrong, is greater than the total time that their lifetime usage of cat was going to waste.<p>There's a set of people who correct the same three pairs of homophones that get used incorrectly, but don't know what the word 'homophone' is. (Har har, they're/their/there). I always liken the people who are so quick to chew someone out for using cat, in the same batch of people who do this: what if I just want to use cat because it makes my command easier to edit? I can click up, warp to the front of the line, and change it real quick.<p>Sorry. I did say, it is a pet peeve.
My biggest quality of life improvement for understanding logs has been lnav (<a href="https://lnav.org/" rel="nofollow">https://lnav.org/</a>) -- does everything mentioned in this post in a single tool with interactive filtering and quick logical and time based navigation.
One thing I've done to identify infrequent log entries within a log file is to remove all numbers from a file and print out a frequency of each. Basically just helps to disregard timestamps (not just at the beginning of the line), line numbers, etc.<p><pre><code> cat file.log | sed 's/[0-9]//g' | sort | uniq -c | sort -nr
</code></pre>
This has been incredibly helpful in quickly resolving outages more than once.
My tips:<p>1) Fuck grep, use ripgrep, especially if you have to scour over an entire directory.<p>2) Get good with regex, seriously, it will shave hours off your searching.<p>3) For whatever application you are using, get to know how the logging is created. Find the methods used where said logs are made, and understand why such a log line exists.<p>4) Get good with piping into awk if needed if you need some nice readable output.
Honestly, the most amazing thing I did with logs was learn how to do subtraction. Any time you have multiple instances of a thing and only some of them are bad, you can easily find the problem (if anyone bothered to log it) by performing bad - good.<p>The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.<p>Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".
One thing I didn't see was how to use GREP to view the lines before and after a match:<p><pre><code> grep regex /var/log/logfile -A5 #To view the next 5 lines
grep regex /var/log/logfile -B5 #To view the previous 5 lines
grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match
</code></pre>
This is super handy to find out what happened just before a service crashed, for example.
Loosely related: a few years ago I wanted a simpler alternative to some of the more feature-full log viewers out there so I threw together a tiny (50kb) app that might be useful to some folks in here.<p>All it does is consistently colors the first field in a line from stdin so you can quickly see which log lines have the same first field.<p>I used it in combination with the parallel[0] command to prefix log lines by replica name when tailing logs across machines: <a href="https://github.com/jasisk/color-prefix-pipe" rel="nofollow">https://github.com/jasisk/color-prefix-pipe</a><p>[0]: <a href="https://www.gnu.org/software/parallel/" rel="nofollow">https://www.gnu.org/software/parallel/</a>
As much as I approve of a skillset to analyze local logs, but after a relatively small scale (10-20 systems), a central decent log aggregation like opensearch or ELK just brings so much value even on 1-3 nodes. It'd be one of the first changes I make to an infrastructure because it's so powerful.<p>And its not just log searching and correlation value. At work, the entire discussion "oh but we need access to all servers because of logs" just died when all logs were accessible via one web interface. I added a log aggregation and suddenly only ops needed access to servers.<p>Designing that thing with accessibility and discoverability in mind is a whone nother topic though.
Glad to see lnav (<a href="https://lnav.org" rel="nofollow">https://lnav.org</a>) already getting some love in the comments. Hands-down the most reliable source of "thank you! I wish I'd known about this tool sooner!" responses, even from v experienced sysadmins / SREs.
A minor optimization is collapsing the grep -v, from this:<p><pre><code> cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4
</code></pre>
to this:<p><pre><code> egrep -v 'THING1|THING2|THING3|THING4' file
</code></pre>
That gets rid of the cat and three greps. Both POSIX and GNU encourage grep -E to be used in preference to egrep.<p>A pcregrep utility also used to exist, if you want expansive perl-compatible regular expressions. This has been absorbed into GNU grep with the -P option.
> you’ll get overwhelmed by a million irrelevant messages because the log level is set to INFO<p>I know this happens, but I think it's because programmers are abusing INFO. In principle it's reserved for messages that are informative at a level sys admins and a few others can make sense of and use. Unfortunately abuse often leads to "We turned INFO off" making it much harder to diagnose things after the fact.
When I was training sysadmins back in the dark ages, one of the rules I taught was: know what good looks like in your logs. If you are scanning hundreds of lines of logging under duress to find a smoking gun, and you don't know the difference between what the logs normally show, and what you are seeing, you'll waste a lot of time.<p>Corollary is that good day logs should be minimal and "clean", e.g not logging a lot, or, logging nice and predictably (which makes them easy to strip out via grep -v, etc.)
> Often log lines will include a request ID.<p>Yes, always include a request id in every request structure you create and include it also in the response and print it. It would seem something obvious that everyone does by default but instead, no, it's not so obvious it seems.
I recently used clickhouse-local to do some log analysis on a lot of elastic load balancer logs (~10s of GBs) and it was spectacular.<p>In short, you can add clickhouse-local to a shell pipeline and then run SQL queries on the data. An example from the docs:<p>$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }'
| clickhouse-local --structure "user String, mem Float64"
--query "SELECT user, round(sum(mem), 2) as memTotal
FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
I wrote <a href="https://github.com/ljw1004/seaoflogs" rel="nofollow">https://github.com/ljw1004/seaoflogs</a> - an interactive filtering tool, for similar ends to what's described here. I wrote it because my team was struggling to analyze LSP logs (that's the protocol used by VSCode to communicate with language servers). But I made it general-purpose able to analyze more log formats too - for instance, we want to correlate LSP logs with server logs and other traffic logs.<p>(1) I wanted something where colleagues could easily share links in workplace chat with each other, so we could cooperatively investigate bugs.<p>(2) For LSP we're often concerned with responsiveness, and I thought the best way to indicate times when viewing a log is with whitespace gaps between log messages in proportion to their time gap.<p>(3) For LSP we have lots of interleaved activity going on, and I wanted to have visual "threads" connecting related logs.<p>(4) As the post and lnav say, interactivity is everything. I tried to take it a step further with (1) javascript, (2) playground-style updates as you type, (3) autocomplete which "learns" what fields are available from structured logs.<p>My tool runs all in the browser. (I spent effort figuring out how people can distribute it safely and use it for their own confidential logs too). It's fast enough up to about 10k lines of logs.
Some extra tips:<p><pre><code> Keep access logs, both when a service receives a request and finishes a request.
Record request duration.
Always rotate logs.
Ingest logs into a central store if possible.
Ingest exceptions into a central store if possible.
Always use UTC everywhere in infra.
Make sure all (semantic) lines in a log file contain a timestamp.
Include thread ids if it makes sense to.
It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
Use head/tail to test a command before running it on a large log file.
</code></pre>
If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.<p>Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.<p>Use atop, it makes debugging machine level/resource problems 10 fold easier.<p>Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).<p>If you keep around a list of relevant hostnames:<p><pre><code> cat $hostname_list_file | xargs -P $parallelness -I XHOSTNAME ssh XHOSTNAME -- grep <request_id> <log_file>
</code></pre>
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command <i>has</i> caused multiple <i>_major_</i> outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.<p>I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.<p>Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.
Surprised to see that under the section "correlate between different systems" tracing isn't mentioned as an alternative approach. That's what tracing is: logging across different systems and getting that structure all stitched together for you.
“scroll through the log really fast”<p>This is a great point and probably underrated.<p>You may also benefit by scrolling sideways really fast.<p><a href="https://en.m.wikipedia.org/wiki/Superior_colliculus" rel="nofollow">https://en.m.wikipedia.org/wiki/Superior_colliculus</a>
Best tips I discovered: use emojis. They have colors and they are easy to spot.<p>For instance if an API call is made use the phone emoji, when there is a timeout use a clock, when an order is dispatched used a package...<p>When you have to go through huge log file it is a life saver.
A few weeks ago I had a windows installer that was silently failing when upgrading from an older version (installation from scratch was working without issues). And as windows install logs aren't exactly easy to read, I was stumped, until I took an upgrade log from an older, working build, strip all date information from both files and compare them, checking all the sections which were different, until I found a line indicating that a colleague had forgotten about a limitation when dealing with msp (don't delete components on minor upgrades, but I didn't throw any stones as I've done the same mistake, twice, one and two years ago...)
One of my favorite tricks is to use a visual difftool.<p>Copy good log into left panel. Copy bad log into right panel. Quickly show which lines are new, which are missing and which are out of order. Obviously ignore the timestamps ;)
One thing we did at my previous job was to add a "trace flag" to each account. Normally, they log nothing about a transaction (other than it happened), but if the trace flag is set, then a lot of information is logged about the transaction. Also, this trace flag is propagated throughout the distributed system they have, so we can trace the action across the network.
One thing that's improved my log analysis is learning awk (well actually Perl but I think 95 % of what I do with Perl I could also do with just awk). Often the most useful way to look at logs is statistically, and awk can quickly let you aggregate things like time-between-statements, counts, rates, state transition probabilities, etc for arbitrary patterns.
Also, sometimes logs stretch across multiple lines, and the other lines won't have the identifier you are searching for. For example, Java stack traces. In that case if you are stuck parsing unstructured logs, the simplest thing to do is to look at the entire file and search for the timestamp that found the first line.
I feel like this is a huge anti-pattern. Use a hosted service that does all of this for you, and then have a whole query language, build alerts, graphs, etc based on these results.<p>It's not super cheap, but it's 10x cheaper than wasting dev time in the terminal. (Sumologic, splunk are the two I can vouch for)
I found the histogram technique to be really helpful. Slight mod - I tend to sort reverse at the end of the pipeline (sort -rn); then |head is often more useful.<p>It's also good to have histograms by hour or day. I've hacked up scripts to do this but I should really make something better!
Re timing, logs like nginx access logs have their timestamp from when the request completed, not when the request came in. That's a significant difference for long duration (~10s+) requests, and matters when trying to correlate logs or metrics to a request.
I saw a tip a whilr back about not needing to keep adding "| grep -v stuff" but instead "grep -e -v stuff -v stuff2" i remember getting it to work on Linux but last i tried that on macos I didn't have much luck
We've added a log tailing feature into our product UI which also has a basic find/filter. It's been enormously useful for cases where something weird happens as you can immediately access the last few mins of logs.
If I had a nickel for every time I have used this pattern<p>... | perl -ne 'print "$1 $2 ...\n" if /some-regex-with-capture-groups/' | ...
If you are using Kubernetes, I highly recommend using <a href="https://github.com/stern/stern" rel="nofollow">https://github.com/stern/stern</a>
in addition to these one of my favorite is a perl one liner that generates time delta from a regex pattern of interest. And then I plot it using gnuplot. It seriously helps to 'see' the events with timing in a chart & allows you to do a quick visual search for problem areas.
Do we still use utilities like grep for searching logs? Are these when we cannot stream logs to tools like Splunk & Loggly and use their search services?