I was working on an old old old "ERP" system written in D3 PICK. It's a database, programming language and OS all in one with roots in tracking military helicopter parts in the 1960's. I was working on it in the mid-2000s.<p>It had SQL like syntax for manipulating data, but it was interactive. So you would SELECT the rows from the table that you wanted, then those rows would be part of your state. You would then do UPDATE or DELETE without any kind of WHERE, because the state had your filter from the previous SELECT.<p>It has a fun quirk though - if your SELECT matched no rows, the state would be empty. So SELECT foo WHERE 1=2 would select nothing.<p>UPDATE and DELETE are perfectly valid actions even without a state...<p>Working late one night, I ran a SELECT STKM WHERE something that matched nothing, then before I realised I realised my state had no rows matched, I followed up with DELETE STKM.<p>Yep, the entire Stock Movements table four the last 20+ years of business were gone.<p>The nightly backup had not run, and I didn't want to lose an entire day of processing to roll back to the previous night.<p>I spent the entire night writing a program to recreate that data based on invoices, purchase orders, stocktake data, etc. I was able to recreate every record and got home about 9am. Lots of lessons learnt that night.
I wrap all of my production manipulations in a transaction, and commit only if the results are expected. Yes, it may take locks that block customer-facing transactions, so I have selects ready to go in the transaction to minimise this.<p>15 years and counting since wiping out a large production table and taking a day to restore from backup.
"All of your colleagues have done something dumb. Don't be afraid to tell us when you make a mistake. We all remember our first screw up and will be happy to help."<p>Never have truer words been spoken.<p>As I tell all the new juniors at work doing sysadmin type tasks, everyone has deleted the production database at least once. Mistakes will always happen, it's how you deal with them that defines how good you are at the end of the day.
I did something worst many years ago: I was working for a regional ISP and during a major incident, I had to reroute traffic through a different path. Under big pressure, I did the infamous Cisco mistake "switchport trunk allowed vlan 50" instead that "switchport trunk allowed add vlan 50" and I locked out myself and all the customer from our broadband customers. We had to call a DC technician and ask him to share a console through a local console server.<p>Lesson learned: even if you are under big pressure take your time to plan and review the modification
15 minutes can save hours.
Early on when I'd first started making the transition from pure developer role working only on product to a platform role running the development environment, I was encountering problems with build scripts on CI servers leaving behind a bunch of dead symlinks. Tired of tracking them down manually, I wrote a nice script that automatically found all dead symlinks and deleted them.<p>It turned out, for some arcane reason I still don't understand, our production instance of Artifactory was running on top of Docker Compose with host path volume mounts, and somehow, symlinks that were not valid from the perspective of the host actually were valid from inside the container, and doing this on all of our servers broke Artifactory. For some even stupider reason, we weren't doing full filesystem-level snapshots at any regular interval (which we started doing after this), so instead I needed to enlist the help of the classic wizard ninja guy who had been acting as a mostly unsupervised one-man team for the past six years who had hacked all of this mess together, documented none of it, and was the only person on the planet who knew how to reconstruct everything.<p>This was probably still only the second-stupidest full on-prem lab outage I remember, behind the time the whole network stopped working and the only person who had been around long enough remembered they had trialed a temporary demo hardware firewall years earlier, management abandoned the evaluation effort, and it somehow remained there as the production firewall for years without ever being updated before finally breaking.
A few months into my first job out of college, I brought down the main production server in the middle of the workday. It took us about an hour to recover. Afterward, I was very embarrassed and apologetic, but my boss just shrugged and said:<p>"You're not a real technology worker until you've brought the company down. Welcome."<p>Might not be the best words to live by, but it was exactly what I needed to hear at that time early in my career.
Back in the 90s I remember a work colleague asking 'can you rollback a drop table?', to which I replied 'no', and all the blood drained from his face in seconds. It's one of those things you've heard happens to people, but until you see it, you can't quite believe it.
My chooosen database explorer is Dbeaver. Horrible name but great app.<p>You can set colours for local/test/prod servers and a red colored tab will scream at you to be cautious. And with red color every edit will pop up an "are you sure?" question. And autocommit is off.<p>I sorta stopped making unrecoverable mistakes.
My manager had got me to look at backups.. but for cheap.<p>I decided on bacula - I had the clients installed on all the computers in the office, and it worked for some small tests.<p>My manager decided we would try this with a USB drive attached to one of the servers (somehow this didn't seem like a bad idea).<p>In the morning, very uncaffinated he sent me to the other site - an unmanned basement office with the servers.<p>Being uncaffinated I forgot the door password and set off the alarm.<p>I had to go into the office and phone him with the alarm going to get the code to turn the alarm off.<p>OK, that was stressful but sorted out at least.<p>I plugged in the hard drive to the selected server and headed back.<p>Once I got back it turned out all the websites on that server had gone down - trying to send all the backups to this poor USB harddrive had overwhelmed the IO on that-era Linux server and the poor thing just froze.<p>Fairly soon after I was let go, and joined my friends at a much more fun company making mobile games.
My first sysadmin job at a call center, the call center reps use the same directory for all the users. And, I'm working tickets to delete old users accounts...<p>The old grey haired sysadmin backs up the directory so he can instantly restore it. Seems this happens all the time.<p>Whew.
The other learning I get from this story is "never hide your errors" or "own your errors as you own your victories".<p>If the author had decided to say nothing the problem would have been bigger - an unhappy boss and probably fired.
One day my colleague was wondering whether RegEdit used some private API for renaming keys or just copied + deleted them. "Try a rename with a big tree, see if it's still instantaneous" I helpfully proposed. But what's a really big tree? How about System\Windows? The rename completed instantaneously - "told you it's a private API" I said happily, just as the machine crashed in twenty ways at once.
I did worse.<p>We had a redis with sessions. Early on, someone decided every write to redis should also cause a write to S3 as backup. My first task was to get rid of this 4-digits a month extra cost in PUT requests. I decided to instead write all changed session objects into a set keyed by half-hour timestamps and then write only those sessions every 30 minutes. Unfortunately initially I used a KEYS to find the set corresponding to my half-hour stamp, not having read up exactly on what it does. It's not exactly advisable to do on a redis with a million or so objects. A later version of the archiver wrote the last emptied set to a stable key instead and then checked the set keys between then and now instead...
I was taught to:<p>1. Write your WHERE clause first
2. Return to the beginning of the line to finish writing the statement
3. Check your statement
4. If it looks good, then -- and only then -- add your closing semicolon<p>Having said that, once during my second week at a new company, I plugged in an ethernet cable to an APC UPS, so I could set up networking on it. It shut down production. Why? APC makes (for that model at least) proprietary ethernet cables for networking, and if you plug in a regular cable it does an autoshutdown...an engineers attempt at marketing perhaps!? I did RTFM before, and after out of confusion, and there was no mention of this.
For some reason your title made me think of this classic IT Crowd<p><a href="https://youtu.be/Vywf48Dhyns" rel="nofollow noreferrer">https://youtu.be/Vywf48Dhyns</a>
I wish a DELETE or UPDATE only affected a single row by default (and perhaps even wouldn't commit if it would hit multiple rows), unless a keyword for MANY or something similar was added.<p>Aka DELETE ALL where x == y or DELETE MANY where x == y or perhaps you need an explicit limit for it to not be 1, so DELETE where x == y LIMIT ALL
A charming story which almost everyone can relate to! Only one of your rules will ever save you. “Don't run updates directly in the database console”. Whole methodologies are crafted around this rule/principle to not do development in production environment.
In the top bar of the site:<p>> Sorry! Subscriptions were broken last week, but are now working. If you tried to subscribe and ran into issues, please try again!<p>I wonder if a similar incident involving a “UPDATE subscriptions” query happened recently.
Decades ago my ex colleague was supposed to enter a command handwritten on a piece of paper saying "rm -rf /var/log/blah/<i>" which she typed in as "rm -rf /var / log/blah / </i>". Everyone knows it's awesome to insert white space to increase legibility. It was a production database server.
Roughly 10 years ago, I was working for a startup that offered a live conversational video service where you could also have hundreds (or eventually, thousands) of near-live watchers - with recording and later playback. The founder pitched the service to news orgs and celebrities. Anderson Cooper had a regular "show" there for a while, and we had a number of interviews with mostly 2nd-tier celebrities.<p>When the service started, they made the decision to not actually delete any content (delete just set a flag which disabled the content but didn't actually remove it).<p>Fast forward a year or so, and it became clear that a real delete was needed. So they had a junior engineer write up a sort of delayed sweep - delete all the videos with the delete flag set. But then, for some reason, they decided put the implementation behind a delay. Something like "actually delete all soft-deleted videos, but don't start doing it until 30 days from now". However, unbeknownst to the team, there was a bug in the implementation that deleted everything, regardless of whether the 'delete' flag was set.<p>So one night, roughly a month later, all the content started disappearing from the site. One guy heroically tried to stop the process, but I think he was too late. The engineering director happened to be on a vacation down in South America somewhere and I think the founder fired him in a fit of pique. I managed to reclaim a small bit of content (basically the videos that were cached on the actual recording servers before they were uploaded to S3).<p>You can imagine the technical over-reaction:<p><pre><code> * Delete switched back to a soft delete
* Turned on S3 object versioning
* Started redundantly copying content onto a totally different hosting service
</code></pre>
This was fine (hah!) until we had to start taking down the inevitable child porn that always shows up on services like this - I got stuck with writing the takedown code and it took me forever to track down all the various tendrils of stuff.<p>As you might expect, we lost a ton of users over mass content deletion and the service never really rebounded. The company held on for a couple more years, pivoting a couple of times, but eventually folded.
pfft<p><pre><code> rmdir . /s /q
</code></pre>
only to notice what I am in the wrong folder, way up in the hierarchy.<p>And repeat the same error years later when the batch file failed to cd to the destination folder. Added<p><pre><code> if %CD% == %DESTDIR%
</code></pre>
to avoid that problem.
I switched companies, and moving from a MySQL cli to pgadmin is a godsend. Would still like a confirmation dialogue, but having to click a button seems less error prone than pressing enter too quickly.
These kinds of scenarios happen when money is "cheap", and highlight why the current recession, and coming 2nd great depression aren't really a bad thing.
Ah yes, the old lack of 'where' clause.<p>Did much the same, but only on my own system (thankfully!) and yes, I had an up-to-date backup MySQL dump on hand.
I had a thing recently where someone was updating some entries on a system I look after, decided the changes hadn't applied properly, and clicked "Roll back" to put it back to its original state.<p>Whatever had gotten into it, it rolled back to 2009. It rolled everything back, including user accounts.<p>No-one who worked there in 2009 still worked there, so no-one had a valid password any more.<p>Fortunately it was easy enough to copy the last-but-one backup over the top and lose the day before's config updates, and cure its Flowers for Algernon state, but it was a pretty hairy afternoon.