TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Handling Human Error in the Datacenter

12 pointsby slackerIIIalmost 17 years ago

4 comments

lscalmost 17 years ago
a problem everywhere I've worked, especially in places where the rack is 'organically grown' is power cables getting accidentally pulled when other servers are added or removed.<p>on all servers that I have physical access to, I use zipties on both ends of the power cable. you need a knife to unplug anything. One problem, at least, solved.<p>Generally, I categorize mistakes as 'mistakes of knowledge' (that is, I did the wrong thing because I believed something that was incorrect.) and 'mistakes of inattention' (where I knew it was the wrong thing to do, but I wasn't paying attention and did it anyhow.)<p>Generally, you don't make the same mistake of knowledge twice, so I don't worry about them much. They happen, but they only happen once. Learning, we call it.<p>Mistakes of inattention are much worse, in my opinion. without further action, I will almost certainly repeat a mistake of inattention.<p>The idea is that every time you make a 'mistake of inattention' you put in place a procedure that will prevent the mistake.
评论 #273510 未加载
评论 #273683 未加载
sharjeelalmost 17 years ago
Also, if your scripts have any dev mode features for testing (such as cleaning up some database values and regenerating, removing some files etc), make sure that you are unable to execute them on production or some sort of confirmation is required.<p>I had a script on my server that did clustering of stories from different news sources. The script also had some test methods which deleted all the clustered data and rebuilt it. I once accidentally ran the "cleanup method" on prod server and that created disaster because somehow cascaded deletion took place. I had to refer to replay log to get everything back and took hours of efforts plus a lot of pressure. From then onwards I placed a check on each of my script to get a confirmation twice before executing any such test method on prod server.
sysop073almost 17 years ago
For his last suggestion about coloring the terminal background, it might be easier to just color the name of the machine in the prompt<p>e.g.: <a href="http://i34.tinypic.com/5ecthx.jpg" rel="nofollow">http://i34.tinypic.com/5ecthx.jpg</a>
评论 #273316 未加载
a-priorialmost 17 years ago
You should also look into software such as Puppet to reduce the amount of manual administration you have to do.