TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Joyent us-east-1 rebooted due to operator error

104 pointsby hypervisoralmost 11 years ago
Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted. Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved.

11 comments

bcantrillalmost 11 years ago
It should go without saying that we're mortified by this. While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter. As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future (and that the recovery is smoother for failure modes of similar scope).
评论 #7807248 未加载
评论 #7807366 未加载
lukasmalmost 11 years ago
Mandatory DevOps Borat<p>&quot;To make error is human. To propagate error to all server in automatic way is #devops&quot;<p>and my fav &quot;Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.&quot;
alrsalmost 11 years ago
Joyent&#x27;s messaging about &quot;we&#x27;re cloud, but with perfect uptime&quot; was always broken.<p>It&#x27;s mildly gross that the current messaging sounds like they&#x27;re throwing a sysadmin under the bus. If fat fingers can down a data center, that&#x27;s an engineering problem.<p>I care about an object store that never loses data and an API that always has an answer for me, even if it&#x27;s saying things that I don&#x27;t want to hear.<p>99.999 sounds stuck-in-the-90s.
评论 #7807202 未加载
评论 #7807395 未加载
评论 #7807478 未加载
Diederichalmost 11 years ago
The &#x27;devops&#x27; automation I made at my last company (and am building at my current company) had monitoring fully integrated into the system automation.<p>That is, &#x27;write&#x27; style automation changes (as opposed to most &#x27;remediation&#x27; style changes) would only proceed, on a box by box basis, if the affected cluster didn&#x27;t have any critical alerts coming in.<p>So, if I issued a parallel, rolling &#x27;shutdown the system&#x27; command to all boxes, it would only take down a portion of all of the boxes before automatically aborting because of critical monitoring alerts.<p>Parallel was calculated based on historical but manually approved load levels for each cluster, compared to current load levels. So parallel runs faster if there&#x27;s very low load on a cluster, or very slowly if there&#x27;s a high load on a cluster.<p>One way or another, most automation should automatically stop &#x27;doing things&#x27; if there&#x27;s critical alerts coming in. Or, put another way, most automation should not be able to move forward unless it can verify that it has current alert data, and that none of that data indicates critical problems.
jameshartalmost 11 years ago
DevOps means being able to take out an entire datacenter with a single keysstroke...
评论 #7807124 未加载
评论 #7807114 未加载
shiftpgdnalmost 11 years ago
Let this be a lesson to linux admins. Re-alias shutdown -r now into something else on production servers. I once took down access to about 6000 servers because I ran the script to decommission servers on our jump box when I got the SSH windows confused.
评论 #7807400 未加载
评论 #7807130 未加载
评论 #7807336 未加载
评论 #7807112 未加载
评论 #7807349 未加载
dharbinalmost 11 years ago
salt &#x27;*&#x27; system.reboot
评论 #7807313 未加载
评论 #7807327 未加载
jordanthomsalmost 11 years ago
Looks like the janitor needed somewhere to plug in the vacuum cleaner again...
评论 #7807330 未加载
评论 #7807190 未加载
devineganalmost 11 years ago
Joyent has been having some serious issues over the past month or two. I am not sure if it is growing pains, bad luck or what is happening, but we had already lost faith and trust in their Cloud prior to today. This is the nail in the coffin from our perspective. Moving on...
评论 #7807520 未加载
shanselmanalmost 11 years ago
&quot;What&#x27;s this button do?&quot;
SEJeffalmost 11 years ago
As I&#x27;ve always said, &quot;You can never protect a system from a stupid person with root&quot;.<p>You can limit carnage and mitigate this type of thing, but you can&#x27;t fully protect against sysadmins doing dumb things (unless you just hire great sysadmins)
评论 #7807192 未加载
评论 #7807081 未加载
评论 #7807517 未加载
评论 #7809823 未加载
评论 #7807151 未加载