The definitely most difficult bug I have ever worked on was a device that would have its flash memory erased from time to time. With couple of million of those in the field only about one hundred would be affected, every day meaning the frequency was enough to cause huge financial issue for the company (bricked devices to be destroyed and replaced, unhappy customers, PR nightmare) yet not frequently enough to be observed in lab environment.<p>We had to set up couple dozen of these doing operations 24/7 just to be able to note single occurrence of the problem maybe once a week or two.<p>The device was built in a way that made it impossible to observe physical lines between CPU and flash chip. This was intended as a security feature but caused the whole debug procedure to be extremely difficult.<p>The start of the problem could not be linked to any particular change in the software.<p>In the end we have tracked the problem to a single decision made a year before the problem started showing up.<p>The decision was to use UNLOCK BYPASS feature. UNLOCK is a special command sent to flash chip that kind of validates that the message was not garbled. UNLOCK BYPASS turns off the use of this feature. This was done to improve write performance as the UNLOCK command slows down writes. With the UNLOCK turned off, flash chip is more likely to interpret noise on its lines as a valid command.<p>This change did not immediately cause problems. Only later, another chip on the board started being used in a bit different way which caused much more noise to be generated. With the higher noise floor the flash chip would occasionally execute a command that was not sent from the CPU.<p>The irony here is that the rules for the construction of the device required that some of the signal lines are sandwiched in the inner layers of PCB between two layers of signal lines, to prevent easy access to the inner lines. With 4 layer PCB this prevented large ground planes that would help control induced noise that was causing the issue.<p>Whenever we tried to replicate the problem on a set of development devices (devices with the communication lines available for probing), the problem would not show up due to a different layout of the PCB.<p>In the end debugging the problem took about half a year.
Two bug stories of limited usefulness:<p>Worked on a tool that used 3D trajectories in an analysis. Some of the output looked strange, but not exactly incorrect (there was no obviously correct answer). Looking at the trajectories used in the analysis, we started thinking some of them had to be wrong. We isolated the "most" incorrect ones and dug in to the code. After looking at the x, y, z components, that pointed us to a few functions. We found a function call with a typo. Instead of f(x,y,z) it was called with f(x,y,y). That one took a day or two to figure out.<p>Working on a tool that plotted satellite trajectories as part of results visualization, there was a strange jump in the orbit plot. Orbits before and after were fine. We eventually narrowed down the jump to crossing a specific date and time, and not at an obvious boundary (some day in the middle of the year if I remember correctly). There was no obvious reason for it to occur (no errors in our calculations or the input data). Eventually, we discovered that a leap second had been added on that date. The libraries we were relying on did not include that leap second since it had only been added recently, but the input data did. That was... frustrating. If I recall correctly, leap seconds are no longer added to time information (thankfully).
> 2. Stabilize, Isolate, and Minimize<p>I feel this section is a bit neglected in terms of clarity and emphasis, it brushes on a lot of advanced details but looses sight of the basics important to a novice... Anecdotally but with more than one sample, the most common rudimentary mistake I see is someone attempting to isolate a bug upside-down, i.e poking around in small portions of code that do not comprise the whole system involved in the symptom, without any prior reason to be confident that the portion of code is involved.<p>In 99% of cases, starting with the whole system involved in reliably reproducing the symptom and then bisecting, or some guided divide and conqueror techniques etc, will get to the interesting parts of code fastest and with confidence that allows focus on elusive bugs. Yes this is not guaranteed to work, e.g notoriously difficult bugs with multiple, disparate factors or even timing issues will evade this technique, but you would still attempt it before resorting to more challenging methods.<p>Perhaps most teachers miss explicit emphasis of this because it seems so obvious to them, it seems implicit to the word isolate, but if I had to pick one single concept for novices, it would be this one.
One of my favorite types of bugs in C is what my professor called 'unlucky bugs'<p>A bug the code where changing the arrangement of your code can with fix or create. Its possible if you have a static array initialized to the wrong length.<p>Changing the arrangement offsets the initalization segment of memory which is why this is possible.
If you're interested in a deeper dive into this, Andreas Zeller's Udacity course is excellent: <a href="https://www.udacity.com/course/software-debugging--cs259" rel="nofollow">https://www.udacity.com/course/software-debugging--cs259</a><p>Despite the name, it's not a "how to drive PDB/GDB/JDB/etc." course, but focuses on the higher level concepts of how to identify bugs and build tools that automate the debugging process.
One of the best debugging tools is to consider the application layers. Item 2 in the list "Stabilize, Isolate, and Minimize" touches on this, but to be more specific: consider your system as a call-stack or as an application stack. Can you find the last place in the system where things work as expected?
I needed a link to something like this while writing the JavaScript debugging tutorial for Chrome DevTools. I think I initially tried to start the tutorial off with conceptual information like this “How To Debug” article but eventually scrapped it because the conceptual preamble was longer than the tutorial itself.
Steve Litt has a Universal Troubleshooting Process which he has been promoting since 1996:<p><a href="http://www.troubleshooters.com/tuni.htm" rel="nofollow">http://www.troubleshooters.com/tuni.htm</a>
> How do Debug<p>Don't.<p>Write tests, assertions and logs to understand what happened when something went wrong. Do not ever 'debug', it is losing your time.