Devops Horror Stories

153 点作者 stevenklein超过 11 年前

26 条评论

My Devops horror stories, one sentence each:- Somebody deployed new features on a Friday at 5pm.- Fifteen hundred machines running mod_perl.- Supporting Oracle - TWICE.- It turns out your entire infrastructure is dependent on a single 8U Sun Solaris machine from 15 years ago, and nobody knows where it is.- Troubleshooting a bug in a site, view source.... and see SQL in the JS.

评论 #6648169 未加载

评论 #6648497 未加载

评论 #6650360 未加载

评论 #6648280 未加载

评论 #6648313 未加载

评论 #6657711 未加载

评论 #6648081 未加载

评论 #6649047 未加载

nailer超过 11 年前

Temporarily mounted an NFS volume to a folder under /tmp.Forgot about tmpwatch, a default entry in the RHEL cron table to clear out old temp files.4AM the next morning, recursive deletion on anything wiuth a change time older than n days.

评论 #6649367 未加载

评论 #6648845 未加载

评论 #6649585 未加载

评论 #6648472 未加载

deckiedan超过 11 年前

Tape Archive System: write a tape, restore it again, and do MD5sum against the original data. Then we know it can be restored correctly, and the original data is deleted.Should be bullet proof?Alas, the 'write to tape' scripts I'd inherited didn't warn if they couldn't load a tape into the drive.There was a tape jammed in the drive, so the tape robot was refusing to load any new tapes, but kept on writing and restoring from the same tape over and over again.Stupidly, we didn't do any 'check a tape from 3 weeks ago' for a while.Lost quite a bit of data. We still have the md5sums though... Still gets shivers thinking about it

评论 #6650618 未加载

splitrocket超过 11 年前

We launched our brand new service into production pointing the backend at our dev instance, at the office. The entire internet showed up at our wee little DSL connection, effectively DDOS-ing our office. We had to leave, go to a cafe with public wifi to fix it.

caw超过 11 年前

My worst horror story was a full server room shutdown. We killed servers, then the chillers, and then started work. About an hour after we started, we pulled our first floor tile to move some power cables. There was water under the floor! We spent the next few hours cleaning up all the water.Apparently water kept flowing into the humidifier tray of the chiller, and the mechanical auto-shutoff never triggered. The pump didn't remove water from the tray because the power was off.Facilities "fixed" the humidifier, but it still happened again when that circuit was cut off for work elsewhere in the building. No one caught the water overflow, and it flowed out down the conduits to the first floor. So we had flooding on 2 different floors from a single chiller.

rdw超过 11 年前

"you can't have more than 64,000 objects in a folder in S3 - even though S3 doesn't have folders." Is this for real, or are these stories made up? All documentation I've read about S3 suggests that it does not have any file count limitations. The timeline of Togetherville suggests that this story took place between 2008 and 2010. Did S3 have a limit back then that they lifted?

评论 #6649741 未加载

评论 #6648514 未加载

评论 #6649075 未加载

评论 #6648479 未加载

评论 #6649718 未加载

ch4ch4超过 11 年前

The customer.io story seems like a great example of why NOT to use budget providers like OVH and Hetzner for mission-critical applications.You get what you pay for.

评论 #6647717 未加载

评论 #6648226 未加载

评论 #6648319 未加载

评论 #6648391 未加载

评论 #6650978 未加载

评论 #6648676 未加载

Fizzadar超过 11 年前

You can hardly be surprised when OVH or Hetzner go down, just consider the price. Putting every server in one location is just stupid... as always the best way to fight downtimes is to spread servers across multiple providers & DC's.

评论 #6652415 未加载

评论 #6652702 未加载

评论 #6650183 未加载

jrockway超过 11 年前

One thing I've learned: the real value of replication is how easy it makes it to handle strange events without getting stressed out.It's 3AM. You're being paged with a high latency alert in one datacenter. You run one command to drain traffic out of that datacenter. The latency graph starts looking normal again. You go back to bed at 3:05. You look at the logs and figure out what went wrong tomorrow morning.

cygwin98超过 11 年前

Are there any open source load balancing solutions like what Amazon ELB does? Say, install the load balancer on to one or two Amazon VPS, proxy traffic to third party VPS/dedicated servers, Linode, OVH, etc. Wonder how feasible this approach is?

评论 #6648594 未加载

评论 #6648490 未加载

patmcguire超过 11 年前

At least Amazon doesn't lose your servers.<a href="http://www.informationweek.com/server-54-where-are-you/6505527" rel="nofollow">http://www.informationweek.com/server-54-where-are-you/65055...</a>

评论 #6655740 未加载

InclinedPlane超过 11 年前

A few devops horror stories:- Someone on the hardware team deleted several VMs that were being used as build machines, there were no backups. That wasted around 2 days to get things back to normal.- During a show I volunteer for: a scissor lift drove over a just run (several hundred foot) ethernet line and severed it, they had to run a new line.- PCs running windows being set up, as point-of-sale systems, to run with static IPs on the internet, without a firewall running. Disavowed all responsibility and left them on their own for that. They would have run them unpatched too without intervention.- Someone checked a private key into the repository. Plan of action: obliterate from all branches everywhere, delete from all build drops (which contain source listings too), track down all build drop backups on tape and restore-delete-then-recreate them. Luckily I handed that job off to someone else.

mercurial超过 11 年前

Coworker says "I'm going to do some clean-up on the server." Two minutes later, "Oh crap." He had wiped out /var/lib. And tell you what, the server kept working. We didn't dare rebooting it though.Another fun one was coming in one morning, and cleaning up after somebody used some foul PHP provisioning scripts on a customer system and had the unfortunate idea to use a function called "archive". Turned out the function didn't so much "archive" as "delete". Henceforth deletion, especially unintended deletion, was known as "shotgun archival".

perlpimp超过 11 年前

alt.sysadmin.recovery lives on! albeit in a web app. wonder if usenet is still alive...

评论 #6648203 未加载

评论 #6655058 未加载

nl超过 11 年前

In a script: sudo chmod -R apache:apache . /Note that space? I didn't.

评论 #6658499 未加载

redbad超过 11 年前

The first two stories are notable in how they reflect the terrible practices of the teller."Our distributed application produces the same type of error after the same period of time in totally different data centers. We have no idea why, but moving data centers seems to help, so we just keep doing it. #YOLO""We've built a product on a data store and library we don't understand even the highest-level constraints of. That ignorance bit us in the ass at peak load. We patched over the problem and continue gleefully into the future. #YOLO"These stories should be embarrassing, but they're seemingly being celebrated, or at least laughed about. Am I off base?

评论 #6652968 未加载

peterstjohn超过 11 年前

"Oh, just use keys * to work out what's there.""No, wait, don't…!"<site down>

liquidcool超过 11 年前

Mine was simple: I did a middle-mouse-button paste of "init 6" into a root window of our main Solaris server that hosted about 100 users, mid-day. Boss shrugged it off, stuff happens.But that's because it was properly configured so a reboot was smooth and didn't have any snags or affect other systems once back online. At another data center across the hall, if their main server needed to be rebooted (not accidentally!), it was 3 days of troubleshooting to get it back up. I learned that after the boss hired one of their admins - not surprisingly, a big mistake.

cpt1138超过 11 年前

(worst) update table set column = 'blah' WITHOUT a where clause (thank god for backups)(2nd worst) delete from table where created < 'old_date' WITHOUT an account (thank god again for backups)Lesson learned, always backup and write the WHERE clause first

评论 #6650715 未加载

allworknoplay超过 11 年前

Let my cofounder near the backups. Whoops.Had a friend who recently took down his nic over ssh; he claimed he managed to get back in using some sort of serial over lan magic but I suspect he really just got someone on the other end to help.

porker超过 11 年前

Any dibs as to what happened for customer.io at Linode/Hetzner?

joeshred超过 11 年前

service network stop

评论 #6652495 未加载

iSnow超过 11 年前

So they hosted at Hetzner and OVH, both extremely cheap hosters, and were surprised that things did not go smooth?Extremely professional.

neumann超过 11 年前

Academic devops horror story in one word: Ruby.

fit2rule超过 11 年前

TODO: Monday Morning: T1 install will be complete. Tuesday: Test/bootup period. Wednesday: Sales start Thursday: Sales continue, TV ad goes live Friday: Champagne!Reality: Monday Morning: T1 did not get installed. Tuesday: Emergency ISDN solution (stolen from Chiropractors next door) Wednesday: Modem rack catches fire Thursday: TV ad goes Live Friday: T1 goes live. Champagne.

评论 #6649085 未加载

hga超过 11 年前

In an early phase of MIT's EECS transition from Multics (going away, Honeywell sucks) to UNIX(TM) on MicroVAX IIs, i.e. some users, but not as many as latter.# kill % 1Instead of %1. So I zapped the initializer, parent of everything else, logging everyone out without warning.I had more than enough capital to avoid anything more than the deserved ribbing, but it was my Crowning Moment of Awesome devop lossage; harsh but minor screwups in the decade previous had trained me to be very careful.I've avoided being handed the horrors of many other posters by primarily being a programmer. You full timers earn my respect.ADDED: Ah, one big consequential goof, related to my not being a full time sysadmin but knowing more than anyone else in my startup. Buying a Cheswick and Bellovin style Gauntlet Firewall from TIS ... not realizing they'd just been bought by Network Associates, who promptly fired anyone who knew anything about supporting that product.... (At that time I didn't even know about iptable's predecessor, although given it was a Microsoft shop....)I was fired from that job in part because I was the least worst sysadmin in the company, totally consumed with a big programming and database migration effort (Microsoft Jet -> DB2 -> DB2 on a real server), and gave opinions that others sometimes accepted and implemented without due diligence. E.g. I said "this is a competent ISP", not "you should also use their brand new email system" (which I didn't even know existed) ... visibility all the way up to the CEO is of course not always good....