This is a handy list.<p>> 4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…<p>Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.
Not all servers are containerized, but a significant number are and they present their own challenges.<p>Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.<p>To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.<p>Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.
somewhat related: /rescue/* on every FreeBSD system since 5.2 (2004) — a single statically linked ~17MB binary combining ~150 critical tools, hardlinked under their usual names<p><a href="https://man.freebsd.org/cgi/man.cgi?rescue" rel="nofollow">https://man.freebsd.org/cgi/man.cgi?rescue</a>
<a href="https://github.com/freebsd/freebsd-src/blob/main/rescue/rescue/Makefile">https://github.com/freebsd/freebsd-src/blob/main/rescue/resc...</a>
When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)<p>These were a lifesaver multiple times.
I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.
I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you <i>can</i> capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.
In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:<p>Build a container with a one-liner:<p>docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOF<p>Run attached to the host network:<p>docker run -dP --net=host moremagic/docker-netstat<p>Run system tools attached to read host processes:<p>for sysstat_tool in iostat sar vmstat mpstat pidstat; do
alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}"
done
unset -v sysstat_tool<p>Sure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.
Would these tools still be useful in a cloud environment, such as EC2?<p>Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?
The list is great, but only for classical server workloads.<p>Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.<p>And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.
I use zfsbootmenu with hrmph (<a href="https://github.com/leahneukirchen/hrmpf">https://github.com/leahneukirchen/hrmpf</a>). You can see the list of packages here (<a href="https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.packages">https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...</a>). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).<p>I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.
Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.<p>This probably would have prevented a stressful incident 3 weeks ago.
tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more.
Under OpenBSD this is not an issue as you have systat and friends in base.
> and...permission errors. What!? I'm root, this makes no sense.<p>This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.