Worst outage: Christmas 1998. Site is locked down, all changes prohibited due to recently instituted end of quarter/end of year freezes as part of “professionalizing” our services. Am just out of cell phone range visiting family when a call comes in that the site is down.<p>Drive back to family home in Chicago suburbs, dial in on one phone line while calling into open conference call on other line. Learn that site started decaying a few hours earlier but service management decided not to inform me until actual outage commenced. Over the next few hours the site goes completely black, the AFS fileservers have lost their minds and won't accept connections from the httpd clients serving content.<p>We had been in the process of migrating off this old complex onto a newer complex which used DFS on newer hardware, so I hot flipped the site to the stale content on the new complex, so at least the site is up, sort of. I make minor tweaks to make it look more recent.<p>Returning to the "old" complex, we learn that although service changes had been prohibited, a manager in the service organization had decided to bypass pretty much all internal processes and pushed through a change to the routers. I don't remember the precise change, think it had something to do with HSRP, or virtual MAC addresses. Whatever the change was, totally hosed AFS which was dependent on the very thing being changed. A normal review would have caught this, but since it was Christmas and the guy was in a hurry, no one caught it.<p>Over the next 24 hours from Christmas Eve into Christmas Day myself, working from my parents' spare bedroom, my lead sysadmin, working from a cabin in the Rockies, and my lead webmaster, working from HIS parents' home in the UK manage to resurrect the site from backups (the site itself was running out of datacenters in Columbus, OH and Schaumburg, IL).<p>The punchline: at the time my second line manager is the CIO. Over the entire outage I've kept him in the loop on what we were doing, who was helping, etc.<p>The following Monday he's on his regular call with Global Services, reviewing incidents, issues, etc. No one mentions that the corporate site had been down for, effectively, two days. Finally, he brings it up, causing a colossal bureaucratic shitstorm.<p>The end result? I'm reprimanded for a couple of minor infractions (a slap on the wrist considering my sole motivation was getting the site back online). The sysadmin who worked through her vacation from a backwoods cabin? Fired. Not for the work she did to get the site back online, but because management felt she had too had bypassed process and should not have focused on getting the site online, but on keeping management informed (which, it turned out, they were but had ignored the trouble tickets coming in).<p>The manager who approved and pushed through the "minor change" that took the site offline? Promoted.<p>Lesson learned: for all of the talk about relying on the I/T professionals, they were just as apt to make colossal mistakes, but could fall back on "process" and bureaucracy to avoid accountability. I was advised that the next time the site was down, that I should rely on the professionals to return it to service, and that if it took multiple hours or days, so be it.