Reminds me of the time I had written a physical simulation engine back in grad school and there was a "minus" sign error. Of course, the error was rare enough that we didn't notice it until after the code was used in a real production environment. Tracking down one minus sign in several hundred thousands of lines is a pain. Not to mention the uneasy feeling you get after you solve it, "How was everything ever working correctly before!? What else did we overlook?"
I'm not completely satisfied by the explanation. I still have that uneasy feeling that you get when you solve a bug, but an unsolved mystery remains. "Also, I still don't know why not all consoles connected to that PC froze."
I'm reminded of this story of the folks who worked on LEO hunting down a similarly difficult-to-find bug that was eventually found to be caused by an unrelated external machine: the manager's elevator. <a href="https://www.youtube.com/watch?v=Lrn24SdW64I&t=2m50s" rel="nofollow">https://www.youtube.com/watch?v=Lrn24SdW64I&t=2m50s</a>
I once spent an afternoon tracking down a "bug" as to why sales tax wasn't being calculated on LedgerSMB only to find out I had set the tax rate to 0 in the tax interface.... Ok, it was working as intended. I felt pretty sheepish too.
They could have solved that bug with one developer in ten minutes by just telling the PS3 to generate a core dump and running addr2line.exe on the core dump report's callstacks.<p>And the report places the blame on the server instead of their code. Clearly it's their code's fault for doing blocking sockets calls in a main thread.
This looks like an interesting bug. I wonder if there are more bugs like this from the website view such as analytic tools giving you false or misleading information? Or, even monitoring or performance tools?