Some extra tips:<p><pre><code> Keep access logs, both when a service receives a request and finishes a request.
Record request duration.
Always rotate logs.
Ingest logs into a central store if possible.
Ingest exceptions into a central store if possible.
Always use UTC everywhere in infra.
Make sure all (semantic) lines in a log file contain a timestamp.
Include thread ids if it makes sense to.
It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
Use head/tail to test a command before running it on a large log file.
</code></pre>
If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.<p>Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.<p>Use atop, it makes debugging machine level/resource problems 10 fold easier.<p>Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).<p>If you keep around a list of relevant hostnames:<p><pre><code> cat $hostname_list_file | xargs -P $parallelness -I XHOSTNAME ssh XHOSTNAME -- grep <request_id> <log_file>
</code></pre>
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command <i>has</i> caused multiple <i>_major_</i> outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.<p>I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.<p>Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.