My Devops horror stories, one sentence each:<p>- Somebody deployed new features on a Friday at 5pm.<p>- Fifteen hundred machines running mod_perl.<p>- Supporting Oracle - TWICE.<p>- It turns out your entire infrastructure is dependent on a single 8U Sun Solaris machine from 15 years ago, and nobody knows where it is.<p>- Troubleshooting a bug in a site, view source.... and see SQL in the JS.
Temporarily mounted an NFS volume to a folder under /tmp.<p>Forgot about tmpwatch, a default entry in the RHEL cron table to clear out old temp files.<p>4AM the next morning, recursive deletion on anything wiuth a change time older than n days.
Tape Archive System: write a tape, restore it again, and do MD5sum against the original data. Then we know it can be restored correctly, and the original data is deleted.<p>Should be bullet proof?<p>Alas, the 'write to tape' scripts I'd inherited didn't warn if they couldn't load a tape into the drive.<p>There was a tape jammed in the drive, so the tape robot was refusing to load any new tapes, but kept on writing and restoring from the same tape over and over again.<p>Stupidly, we didn't do any 'check a tape from 3 weeks ago' for a while.<p>Lost quite a bit of data. We still have the md5sums though... <i>Still gets shivers thinking about it</i>
We launched our brand new service into production pointing the backend at our dev instance, at the office. The entire internet showed up at our wee little DSL connection, effectively DDOS-ing our office. We had to leave, go to a cafe with public wifi to fix it.
My worst horror story was a full server room shutdown. We killed servers, then the chillers, and then started work. About an hour after we started, we pulled our first floor tile to move some power cables. There was water under the floor! We spent the next few hours cleaning up all the water.<p>Apparently water kept flowing into the humidifier tray of the chiller, and the mechanical auto-shutoff never triggered. The pump didn't remove water from the tray because the power was off.<p>Facilities "fixed" the humidifier, but it still happened again when that circuit was cut off for work elsewhere in the building. No one caught the water overflow, and it flowed out down the conduits to the first floor. So we had flooding on 2 different floors from a single chiller.
"you can't have more than 64,000 objects in a folder in S3 - even though S3 doesn't have folders."
Is this for real, or are these stories made up? All documentation I've read about S3 suggests that it does not have any file count limitations. The timeline of Togetherville suggests that this story took place between 2008 and 2010. Did S3 have a limit back then that they lifted?
The customer.io story seems like a great example of why NOT to use budget providers like OVH and Hetzner for mission-critical applications.<p>You get what you pay for.
You can hardly be surprised when OVH or Hetzner go down, just consider the price. Putting every server in one location is just stupid... as always the best way to fight downtimes is to spread servers across multiple providers & DC's.
One thing I've learned: the real value of replication is how easy it makes it to handle strange events without getting stressed out.<p>It's 3AM. You're being paged with a high latency alert in one datacenter. You run one command to drain traffic out of that datacenter. The latency graph starts looking normal again. You go back to bed at 3:05. You look at the logs and figure out what went wrong tomorrow morning.
Are there any open source load balancing solutions like what Amazon ELB does? Say, install the load balancer on to one or two Amazon VPS, proxy traffic to third party VPS/dedicated servers, Linode, OVH, etc. Wonder how feasible this approach is?
At least Amazon doesn't lose your servers.<p><a href="http://www.informationweek.com/server-54-where-are-you/6505527" rel="nofollow">http://www.informationweek.com/server-54-where-are-you/65055...</a>
A few devops horror stories:<p>- Someone on the hardware team deleted several VMs that were being used as build machines, there were no backups. That wasted around 2 days to get things back to normal.<p>- During a show I volunteer for: a scissor lift drove over a just run (several hundred foot) ethernet line and severed it, they had to run a new line.<p>- PCs running windows being set up, as point-of-sale systems, to run with static IPs on the internet, without a firewall running. Disavowed all responsibility and left them on their own for that. They would have run them unpatched too without intervention.<p>- Someone checked a private key into the repository. Plan of action: obliterate from all branches everywhere, delete from all build drops (which contain source listings too), track down all build drop backups on tape and restore-delete-then-recreate them. Luckily I handed that job off to someone else.
Coworker says "I'm going to do some clean-up on the server." Two minutes later, "Oh crap." He had wiped out /var/lib. And tell you what, the server kept working. We didn't dare rebooting it though.<p>Another fun one was coming in one morning, and cleaning up after somebody used some foul PHP provisioning scripts on a customer system and had the unfortunate idea to use a function called "archive". Turned out the function didn't so much "archive" as "delete". Henceforth deletion, especially unintended deletion, was known as "shotgun archival".
The first two stories are notable in how they reflect the terrible practices of the teller.<p>"Our distributed application produces the same type of error after the same period of time in totally different data centers. We have no idea why, but moving data centers seems to help, so we just keep doing it. #YOLO"<p>"We've built a product on a data store and library we don't understand even the highest-level constraints of. That ignorance bit us in the ass at peak load. We patched over the problem and continue gleefully into the future. #YOLO"<p>These stories should be embarrassing, but they're seemingly being celebrated, or at least laughed about. Am I off base?
Mine was simple: I did a middle-mouse-button paste of "init 6" into a root window of our main Solaris server that hosted about 100 users, mid-day. Boss shrugged it off, stuff happens.<p>But that's because it was properly configured so a reboot was smooth and didn't have any snags or affect other systems once back online. At another data center across the hall, if their main server needed to be rebooted (not accidentally!), it was 3 days of troubleshooting to get it back up. I learned that after the boss hired one of their admins - not surprisingly, a big mistake.
(worst) update table set column = 'blah'
WITHOUT a where clause (thank god for backups)<p>(2nd worst) delete from table where created < 'old_date'
WITHOUT an account (thank god again for backups)<p>Lesson learned, always backup and write the WHERE clause first
Let my cofounder near the backups. Whoops.<p>Had a friend who recently took down his nic over ssh; he claimed he managed to get back in using some sort of serial over lan magic but I suspect he really just got someone on the other end to help.
TODO:
Monday Morning: T1 install will be complete.
Tuesday: Test/bootup period.
Wednesday: Sales start
Thursday: Sales continue, TV ad goes live
Friday: Champagne!<p>Reality:
Monday Morning: T1 did not get installed.
Tuesday: Emergency ISDN solution (stolen from Chiropractors next door)
Wednesday: Modem rack catches fire
Thursday: TV ad goes Live
Friday: T1 goes live. Champagne.
In an early phase of MIT's EECS transition from Multics (going away, Honeywell sucks) to UNIX(TM) on MicroVAX IIs, i.e. some users, but not as many as latter.<p># kill % 1<p>Instead of %1. So I zapped the initializer, parent of everything else, logging everyone out without warning.<p>I had more than enough capital to avoid anything more than the deserved ribbing, but it was my Crowning Moment of Awesome devop lossage; harsh but minor screwups in the decade previous had trained me to be very careful.<p>I've avoided being handed the horrors of many other posters by primarily being a programmer. You full timers earn my respect.<p>ADDED: Ah, one big consequential goof, related to my not being a full time sysadmin but knowing more than anyone else in my startup. Buying a Cheswick and Bellovin style Gauntlet Firewall from TIS ... not realizing they'd just been bought by Network Associates, who promptly fired anyone who knew anything about supporting that product.... (At that time I didn't even know about iptable's predecessor, although given it was a Microsoft shop....)<p>I was fired from that job in part because I was the least worst sysadmin in the company, totally consumed with a big programming and database migration effort (Microsoft Jet -> DB2 -> DB2 on a real server), and gave opinions that others sometimes accepted and implemented without due diligence. E.g. I said "this is a competent ISP", not "you should also use their brand new email system" (which I didn't even know existed) ... visibility all the way up to the CEO is of course not always good....