Software Engineers: What was your biggest ever f*ck up?

17 pointsby fotoblurabout 12 years ago

I just came across this story where a 'junior' engineer truncated his entire prod Users table (http://news.ycombinator.com/item?id=5292591). Every software engineer I've ever talked to has done something that was a major disaster. Would be great to read about your fails too!Also add what was your lesson learned!

14 comments

codenutabout 12 years ago

My biggest f'up, so far..It happened on my third week as a junior developer on a very nice startup company - its kinda my big dream to work in a startup.It was a friday morning and I was just starting my day at work(I was working remotely) when suddenly one of the cofounders sent out an email that our website is timing out. So I checked out out nagios to see if the website is receiving a large amount of traffic and surprisingly I can even count the number of connections using my fingers. I was a newly hire back then and our lead developer is currently flying on his way back home. My other teammate is not yet online because he is in a different timezone and it is not yet his time to work. So basically I was the only developer available at the time. When my figure out that I have no idea of what is happening, he asked me to just shut down the server so that our customers will not be able to process erroneous transactions. The website is hosted on AWS EC2 and I cannot find our Amazon login credentials(its either I was too dumb at the time or too nervous because later I found out that our lead dev gave it to us a week before) then I decided to shutdown the sever through cli, you know shutdown -h now.Now the other developer got online and asked me what happened. I told him everything then he decided to power up the server so that he can investigate the issue. He logged in to AWS console but he cannot find the server. It turns out that the server's shutdown behavior was set to terminate. And yes, I just destroyed/deleted the server that the website is using. To cut the story short, our lead developer came online and he rebuilt a new server. But still the timing out issue is still there. He found out that it was coming from a MySQL connection and the root cause was that select statement that is very slow. And guess who wrote that query. Yeah its me. A new release was just deployed the previous day and that query was used in one of the new features. The website became operational the following day and everything came back to normal.The next day I became emotional and was depressed the following week that I handed down my resignation because I felt like I dont deserve to work for their company. They tried to talked me out on not leaving. The lead dev even said nice things to me(that Im a good coder and even him will write the same kind of select query if it was assigned to him). But my mind was too clouded and made a very poor judgment to pursue my resignation. And here I am now stuck on a corporate job trying to figure things out and getting my shit back together hoping someday I can work in a startup again and not f'up.

评论 #5309076 未加载

drharrisabout 12 years ago

My first "real job" was at a company that developed equipment for radiological surveys for decommissioning efforts, and after a short time was given the responsibility to develop a VB5/6 application that turned out to make a lot of money and gain a lot of favor contract-wise. A few months in, I was tasked to our largest project (and the largest decommissioning project in the US), and traveled back and forth each week.As someone on the go, I thought it was a good idea to keep the source code for that app on my flash drive (there was no Github back then). For 6 months I worked directly on that flash drive, adding new features to support the large project, and expanding the abilities of the application to gain us even more favor. One day, I plugged in the flash drive and Windows gave the warning that it was corrupt and needed to be formatted. Immediately my heart sank, and the drive was indeed dead. My last backup was about 3 months old, and didn't even include some resources like icons and graphics.Long story short, I had to sit there for weeks and re-code everything I'd lost, using the latest release as a reference to what was missing. On the plus side, my design was probably better the second time around, but nobody was pleased that any new releases would be delayed a month at least.I now keep that flash drive, still in its corrupt state, as a permanent fixture on all the desks I've worked at since. It's a constant reminder to not be stupid when it comes to time-expensive intellectual property.

danudeyabout 12 years ago

Ops story:I worked at a data centre which had an IP KVM attached to all of their machines. When you were logged in as 'admin', there was a mode you could toggle that would send all of your keystrokes to every server, but still only displayed the one you were logged into, so there was no (clear?) visual indication that this was going to happen. Coworker hit Ctrl-Alt-Del to reboot a stuck server, and rebooted every non-Windows server in the data centre (and we only had one Windows server).Every customer got some level of compensation, the noisy ones got a lot of it, and no one ever logged in as admin again other than to relabel servers in the server list.

mb_72about 12 years ago

This will show my age but ... as a junior developer, I was responsible for generating the 'gold' floppy disk set for our application. The second disk of five held hundreds of small report template files, and without a post-disk build defrag the install process for the second disk took a couple of hours instead of a few minutes. For one release - you guessed it - I forgot the defrag on the second disk. I passed the disks to another guy for a test install, and later on in the day he test-passed the install set and send it on for duplication. Hundreds of floppy-disk sets were sent out to clients later that week, and then we started getting many irate phone calls about the slow install process. Turns out the testing guy had missed the slow install rate because he inserted the second disk, then went out to lunch for a couple of hours, and assumed everything had completed quickly when he returned. Lesson learned - have a written checklist for generating installs / deployment (we didn't at that stage).

pindiabout 12 years ago

When defining our initial data schema, we forgot to put a unique constraint on user email addresses. There ended up being quite a few duplicates, so before we added the constraint I had to write a query to remove the duplicate users. About 2/3 of our users didn't have an email listed, and my query failed to take that into account, so it wiped out all but one of those users.

评论 #5298745 未加载

fotoblurabout 12 years ago

My biggest f*ck up:When I worked for a financial institution my manager gave me a production level username and password to help me get through the mounds of red tape which usually prevented any real work from getting done. We were idealists at the time. Well I ended up typing that password wrong, more than 3 times...shit, I locked the account. Apparently half of production's apps were using this same account to access various parts of the network. Essentially, I brought down half our infrastructure in one afternoon.Lesson learned:Don't use the same account for half your production apps. Not really my fault :).

clamattackabout 12 years ago

I've had my share of SQL messes but nothing critical (thankfully!). Probably the worst as far as effect goes was a while back in a low paying dev job. I was under immense pressure to fix some thumbnails for an e-commerce site (as in, if this isn't done in 10 minutes, get your coat and get out). The shop I worked for was getting pressure from the client as they'd put it off for weeks at that point.So.. I write a quick script to resize the master images and re-generate around 2,000 thumbnails. Except... I copy/paste the source path to destination - and I mistype 200px width as 20. Now we have a whole site with long thin product images and no originals to recover from! As in the linked story, no backups were in place and all work was done on production. Lost a weeks wages over that, and had to manually re-add everything from a stack of CD's :)Lesson learned? Don't let pressure force you into making bad decisions. I knew I really shouldn't be doing that but I was young & foolish.

Jeremy1026about 12 years ago

I work at with a medical office management company. We handle the billing, training, hiring, and IT for various medical offices in the area. Included in the IT portion where I am, is the hub of the electronic medical records. One day while working on a new web application to tie into the EMR system I was fiddling with some SQL. After confirming I was logged into the development database I ran some select statements. I moved to a new query window in SQL Server Management Studio and ran a delete statement on a large (100,000,000+ records) table. I forgot to include a where clause so the entire table was wiped. Which was no big deal because it was the development database and it'd be restored in the overnight copy, except that the 2nd query window was connected to the production database. Oops.

评论 #5296702 未加载

eddiemunsterabout 12 years ago

- We got a brand new shiny Xbox devkit (one of the silver ones, only one in the studio), I plugged it in..BOOM!...oh it's a American devkit and I plugged it into a British power socket...ooopss..- Doing a port of PS2 -> Gamecube, one guy asks me 'do we need this assert?' I go 'nah it'll be fine'...cue a month later when we have a intermittent soak crash after several hours which I find out would have been caught instantly by the assert I said was ok to remove...took some time to find :/

keefeabout 12 years ago

I was under the gun for some client facing deadline and I had a crash so I had to rebuild my system. We had registration for our software and nobody was around to give me a key, so I commented out the authentication and call home (not normally in my part of the source tree) then promptly finished my work and committed the whole thing... got caught at the last round of QA fortunately.

k1kingyabout 12 years ago

I managed to code a pretty bad bug that went out and stopped a key module working on a piece of software.Funny thing is, it got through a code review my own personal testing and QA testing.Once the problem came to light it was a very obvious quick fix though.

spoilerabout 12 years ago

Spent 2 hours trying to fix a bug in the wrong place. I was getting syntax errors, because I typed fi instead of if, and I didn't even realise I typoed it.

tectonicabout 12 years ago

rm -rf / some/specific/path

评论 #5300624 未加载

robomartinabout 12 years ago

I wouldn't call this a disaster, but: Coding a somewhat complex embedded application entirely in assembler when it should have been done in C from the start. I knew better, but I got going on the project in assembler and didn't stop.At first maintaining and expanding functionality was not too hard. As time went by it became harder and harder.The fix was to stop everything about a couple of years after the product was already shipping and take three months to re-write it in C. After that adding feature requests and improving functionality was an absolute breeze.