TechEcho

11 comments

bcantrillalmost 11 years ago

It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).

评论 #7807248 未加载

评论 #7807366 未加载

lukasmalmost 11 years ago

Mandatory DevOps Borat"To make error is human. To propagate error to all server in automatic way is #devops"and my fav "Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."

alrsalmost 11 years ago

Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.99.999 sounds stuck-in-the-90s.

评论 #7807202 未加载

评论 #7807395 未加载

评论 #7807478 未加载

Diederichalmost 11 years ago

The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.

jameshartalmost 11 years ago

DevOps means being able to take out an entire datacenter with a single keysstroke...

评论 #7807124 未加载

评论 #7807114 未加载

shiftpgdnalmost 11 years ago

Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.

评论 #7807400 未加载

评论 #7807130 未加载

评论 #7807336 未加载

评论 #7807112 未加载

评论 #7807349 未加载

dharbinalmost 11 years ago

salt '*' system.reboot

评论 #7807313 未加载

评论 #7807327 未加载

jordanthomsalmost 11 years ago

Looks like the janitor needed somewhere to plug in the vacuum cleaner again...

评论 #7807330 未加载

评论 #7807190 未加载

devineganalmost 11 years ago

Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...

评论 #7807520 未加载

shanselmanalmost 11 years ago

"What's this button do?"

SEJeffalmost 11 years ago

As I've always said, "You can never protect a system from a stupid person with root".You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)

评论 #7807192 未加载

评论 #7807081 未加载

评论 #7807517 未加载

评论 #7809823 未加载

评论 #7807151 未加载

11 comments

bcantrillalmost 11 years ago

评论 #7807248 未加载

评论 #7807366 未加载

lukasmalmost 11 years ago

alrsalmost 11 years ago

评论 #7807202 未加载

评论 #7807395 未加载

评论 #7807478 未加载

Diederichalmost 11 years ago

jameshartalmost 11 years ago

DevOps means being able to take out an entire datacenter with a single keysstroke...

Joyent us-east-1 rebooted due to operator error

11 comments

Joyent us-east-1 rebooted due to operator error

11 comments