Linux Crisis Tools

596 pointsby samberabout 1 year ago

24 comments

FridgeSealabout 1 year ago

This is a handy list.> 4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.

评论 #39805044 未加载

评论 #39804552 未加载

评论 #39804730 未加载

评论 #39805427 未加载

devsdaabout 1 year ago

Not all servers are containerized, but a significant number are and they present their own challenges.Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.

评论 #39812679 未加载

评论 #39805671 未加载

infofarmerabout 1 year ago

somewhat related: /rescue/* on every FreeBSD system since 5.2 (2004) — a single statically linked ~17MB binary combining ~150 critical tools, hardlinked under their usual names<a href="https://man.freebsd.org/cgi/man.cgi?rescue" rel="nofollow">https://man.freebsd.org/cgi/man.cgi?rescue</a> <a href="https://github.com/freebsd/freebsd-src/blob/main/rescue/rescue/Makefile">https://github.com/freebsd/freebsd-src/blob/main/rescue/resc...</a>

评论 #39807100 未加载

sargunabout 1 year ago

When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)These were a lifesaver multiple times.

mmh0000about 1 year ago

I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.

评论 #39804608 未加载

评论 #39805749 未加载

donioabout 1 year ago

I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you can capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.

reilly3000about 1 year ago

In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:Build a container with a one-liner:docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOFRun attached to the host network:docker run -dP --net=host moremagic/docker-netstatRun system tools attached to read host processes:for sysstat_tool in iostat sar vmstat mpstat pidstat; do alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}" done unset -v sysstat_toolSure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.

评论 #39804581 未加载

评论 #39805045 未加载

评论 #39805173 未加载

rr808about 1 year ago

You guys get root access? I have to raise a ticket for a sysadmin to do anything.

评论 #39805032 未加载

评论 #39805218 未加载

kureikainabout 1 year ago

I don't see nmap, netstat, and nc being mention. They had saved me so many time as well.

zer00eyzabout 1 year ago

The only thing I would add is nmap.Network connectivity issues aren't always apparent in some apps.

评论 #39805413 未加载

kunleyabout 1 year ago

Brendan Gregg as always with down to earth approach. Love the warroom example

SamuelAdamsabout 1 year ago

Would these tools still be useful in a cloud environment, such as EC2?Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?

评论 #39804607 未加载

评论 #39804584 未加载

评论 #39804635 未加载

评论 #39807258 未加载

js4everabout 1 year ago

Let's add NCDU to the list, it's super usefull to find what is taking all the disk space

评论 #39806358 未加载

评论 #39812220 未加载

pstuartabout 1 year ago

Sounds like it's time to create a crisis-essential package group a la build-essential.

评论 #39804987 未加载

pjmlpabout 1 year ago

The list is great, but only for classical server workloads.Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.

评论 #39807239 未加载

randomgiy3142about 1 year ago

I use zfsbootmenu with hrmph (<a href="https://github.com/leahneukirchen/hrmpf">https://github.com/leahneukirchen/hrmpf</a>). You can see the list of packages here (<a href="https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.packages">https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...</a>). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

评论 #39806275 未加载

sirwittiabout 1 year ago

Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.This probably would have prevented a stressful incident 3 weeks ago.

anthkabout 1 year ago

tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more. Under OpenBSD this is not an issue as you have systat and friends in base.

logifailabout 1 year ago

Doesn't one increase a system's attack surface area/privilege escalation risk by pre-installing tools such as these?

评论 #39806455 未加载

评论 #39806611 未加载

prydtabout 1 year ago

Love the list and the eBPF tools look super helpful.

michaelhoffmanabout 1 year ago

When would you need to use rdmsr and wrmsr in a crisis?

josephcsibleabout 1 year ago

> and...permission errors. What!? I'm root, this makes no sense.This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.

ur-whaleabout 1 year ago

Can't imagine handling a Linux crisis without ssh[EDIT]: typo

SuperHeavy256about 1 year ago

So basically busybox?