I dropped the production database thinking I was connected to the development environment. I actually remember thinking "Stupid MySQL, stop asking me to confirm, I know what I'm doing!" We had to restore from the nightly backup and the entire company lost a day's work.<p>At another company I took a minor marketing landing page offline when a massive burst of traffic came in (it made some expensive API calls, I thought it was a DDOS attack). Turns out it was legitimate traffic. The marketing team had done a big ad spend to generate more leads and didn't tell me. All those expensive leads got 404s.
I updated an SSL certificate on a server used by a travel organisation. A few hours later we got an urgent call from the company saying their iOS app was failing in embarrassing ways. After investigations involving numerous parties all blaming each other it turned out that the iOS app developers, for reasons known only to themselves, had hard-coded the old certificate's serial number into their verification code. We fixed the immediate problem by reverting back to the old certificate which, luckily, was still valid for a few more weeks.
I've been involved in putting out a lot of production fires but disapointingly don't have m/any big ones that I recall causing. There are a few that I allowed to happen under shared team watch, e.g. rollover 31- or 32-bit ids.<p>Maybe some database indexing changes that performed a lot worse for lots of users that had to be reverted. Certainly deploying some protocol incompatibilities either inadvertently or out of sequence.<p>One surprising one was using composite primary keys for a misc table then realizing that some downstream Go service was getting { "id": [1, 2], ... } from the upstream Ruby one. We need to validate schema on write rather than waiting for them to fail to parse.<p>Disaster recovery stories are much more interesting like Hollywood blockbusters. One of my faves is un-f*ing an OS/2 HPFS partition on the west coast over the phone using DOS Norton Utilities 'nu'. Luckily the client was IBM and they had lots of identically configured machines, so just blast the central drive shape definitions (in specific sectors at the start and middle of the drive) from a neighbouring machine and run checkdisk with the recover anything that looks like a valid HPFS structure option.
In Window's Hyper-V hypervisor management control panel for virtual machines, you can connect to other hosts and manage them as if you were on that hypervisor. It's not glaringly obvious that you're working on another machine if you're in a rush. Someone had named the test database server the exact same thing as the production database server in the console, and I was tasked with deleting the test database server as it was no longer needed. We were using shared admin credentials at the time so the session was still up on the test hypervisor. I right clicked it, chose delete, and promptly deleted the production database server.
I assumed a ships rudder could only turn 90 degrees in the autopilot I worked on. Watching the travel channel about a new cruise ship, they discussed their new jets that rotate 180. They said they had to go back to port to deal with an issue with how far they could rotate.<p>I wasn’t on that team when that issue came in. But low and behold a senior dev told me “just so you know, you can’t assume a rudder only rotates 90 degrees”. s he told me the story, I put 2+2 together that it was the same cruise ship I watched on TV.<p>Luckily, you always have manual failovers, and it was an easy fix. But it did at least put some egg on my face :).
Not exactly me but I authorised it: some stupid 3rd party app install that ended up draining daily API allowance and the system almost gone into a standstill... not fun at all.