I just finished spending two days with the Eclipse debugger and WireShark to fix a crash that would happen every 100th Spring RPC call I made. Fix turned out to be removing a single line of code.<p>This got me to thinking that HN probably has a good trove of debugging war stories. What are your favorites? Bonus points if they happened to you.
Sysadmin debugging: Many years ago I had the responsibility of maintaing a qmail setup for SMTP. We had around 1500 accounts and a mail-feed of some 80-90k mails a day. We were happily chewing off a 400mhz P-II with some 256Mb of RAM. Suddenly, the performance of the system began dropping. We only served 2 mails every 2-5 seconds. Tuning up the number of sending qmail processes did not help anything.<p>At this point, your mail spool begins to fill quickly. You can't process mails and get rid of them in the spool and with 80-90k mails a day combined with qmails primitive handling of bodies, your disk is going low quickly.<p>The key observation was that the qmail disk-log service used plenty of CPU-time. Restarting it didn't help. At this point, the next key observation is that most of the time is spent in the kernel. strace(1)! We are using blocking disk-writes to sync log changes to disk. Hmmm. It turned out that the log file was so big that many indirections in the inodes were needed. Qmail logs an insane number of lines per mail, so this combined with a hefty disk fragmentation killed the system. <i>rotating</i> the log file made the system jump to 20 processes sending again and we cleared the some 400k mails in the queue in a matter of 80 minutes.
Not my bug, but my favorite story is the Mars Pathfinder bug caused by priority inversion:<p><a href="http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html" rel="nofollow">http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Path...</a>