TechEcho

3 comments

existenceboxalmost 8 years ago

I share this story from time to time whenever this question comes up. I'm probably a broken record at this point but I've always thought it important to set expectations clearly for new devs by being open about my own failures; and after the recent reddit post it seems about time to braindump once again.I deleted /etc on a live, user facing, production cluster once.Wrote a script to determine OS, settings, a bunch of other bits, and then configure the node appropriately. I sanity checked it for BSD, ubuntu, debian, RHEL, all the machines I thought it would run on.Turns out there was a Solaris cluster.Long and the short; the software I was configuring installed differently on Solaris, my script did not properly audit/validate, and proceeded to, upon not finding the right subdirectories when performing a traversal, declare itself done while still sitting in /etc and nuke the entire dir.The joking lesson I tell myself from this I summarize as a quote my sysadmin mentor told me: "Don't miss."Less glibly, and more actionably,- enumerate your edge cases and failure modes rigorously both from a "what do I expect" and a "what if" perspective. (kinda under this bucket, UNDERSTAND YOUR GODDAMN SPEC, AGGRESSIVELY; this is true both in ops and dev)-Write your code with the EXPECTATION that bits will fail, and have it self audit.-rm * is a big hammer. For all the press DD gets, rm * (and rf) should be used with care and proper precaution, ESPECIALLY if automated. Have extra "mental flags" to give extra care if you see rm *'s and such in your code.-PHASED ROLLOUTS.I'm sure there are more learnings, but those are what come to mind at a thought.To answer the latter half of your question, the repercussion (and remedy) was my boss going to me: "whelp, you get to send out an outage email, and learn how to rebuild a cluster" (not before calling the other sysadmins into the room, having a brief moment of "let's point and laugh" and then sharing their own explosions, some of which made mine pale in comparison :) )

itamarstalmost 8 years ago

An employee who dropped production on first day is not at fault, it's the company's fault. I have similar but not quite as bad story, deploying code that almost brought down our company's main customer. My fault, but organization was at fault too (but to be fair we had ops people who shut it off when it caused problems).So two thoughts:1. How bad the outcome is doesn't necessarily reflect on how big a mistake something is. Software is so complex that even small hard-to-avoid mistakes can cause big problems... and sometimes big mistakes only cause trivial problems. So while big mistakes make good stories (and I'm sure people will post some), every mistake is worth learning from.2. Most problems are, in the end, not an individual's fault. It's a whole system that failed. So don't just like for what you can do better, though that's important. Figure out where the system broke, and how to make the system better.If you want to more deliberately learn from mistakes, Gary Klein's book "The Power of Intuition" is really useful.(I am BTW writing a weekly email with mistakes I've made both programming and in my career - the story I mentioned above is the first email you'd get, and I just sent out the 41st, with plenty more mistakes to come. 20+ years of coding and still more mistakes to make! <a href="https://softwareclown.com" rel="nofollow">https://softwareclown.com</a> if you're interested.)

jonrgroveralmost 8 years ago

I used inheritance rather than composition when writing a wrapper to DataTable (before extension methods existed) in C#. I fixed it later for future companies, but it ran in about ten times the time it should have taken and it killed the product. A little while later I offered to come back to the company to fix the mistake. it would only have taken 2 to 3 hours, but by then the product was dead.

3 comments

existenceboxalmost 8 years ago

itamarstalmost 8 years ago

jonrgroveralmost 8 years ago

Ask HN: What was your worst technical mistake?

3 comments

Ask HN: What was your worst technical mistake?

3 comments