Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted. Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved.
It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).
Mandatory DevOps Borat<p>"To make error is human. To propagate error to all server in automatic way is #devops"<p>and my fav
"Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet."
Joyent's messaging about "we're cloud, but with perfect uptime" was always broken.<p>It's mildly gross that the current messaging sounds like they're throwing a sysadmin under the bus. If fat fingers can down a data center, that's an engineering problem.<p>I care about an object store that never loses data and an API that always has an answer for me, even if it's saying things that I don't want to hear.<p>99.999 sounds stuck-in-the-90s.
The 'devops' automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.<p>That is, 'write' style automation changes (as opposed to most 'remediation' style changes) would only proceed, on a box by box basis, if the affected cluster didn't have any critical alerts coming in.<p>So, if I issued a parallel, rolling 'shutdown the system' command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.<p>Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there's very low load on a cluster, or very slowly if there's a high load on a cluster.<p>One way or another, most automation should automatically stop 'doing things' if there's critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.
Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.
Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...
As I've always said, "You can never protect a system from a stupid person with root".<p>You can limit carnage and mitigate this type of thing, but you can't fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)