I'm sure it's interesting, but it requires sign-up or sign-in, so I won't be reading it (from there ... I may find an alternate source).<p>Edit:<p>Is it reasonable to post the entire answer here?<p>================================================<p><a href="https://www.quora.com/What-is-the-longest-amount-of-time-you-have-spent-fighting-a-code-bug/answer/Andrew-McGregor-12" rel="nofollow">https://www.quora.com/What-is-the-longest-amount-of-time-you...</a><p>================================================<p>Andrew McGregor, Performance Measurement Lead at Fastly (2019-present)<p>Originally Answered: What is the longest amount of time you spent fighting a code bug?<p>Six years, with eight engineers. What’s more, we found the same bug in Windows, MacOS, FreeBSD and Linux, for about six or seven devices. In the case of the Linux and FreeBSD examples we could fix, the change to fix it required changing two characters in the source code.<p>The bug goes like this:<p>Wi-Fi has something called ad-hoc mode, which is very rarely used these days (probably because this bug is still out there). It allows a group of Wi-Fi devices to form a network together, without an access point, and is really quite cool.<p>We were building large outdoor networks using ad-hoc mode, and we found that after around six weeks of uptime, randomly one device would start to be very slow. The slowness would be contagious; after that first device, every reboot would have a chance of being slow when it came back up, until the whole network would be slow and we would have to switch off all the devices, and all our laptops and test gear that had ever joined the network, and cold-start the whole thing. This was massively inconvenient, as some of the devices were at the top of 45 meter lighting poles in a railway yard where we had to make special arrangements to get access to the power switches…<p>We searched for this bug for years. We found dozens of other bugs, and fixed them; some of those fixes have become standard parts of the Linux WiFi stack. We changed to new hardware twice, one of them with chips where we collaborated with the designers during development of the hardware.<p>We discovered many things:<p>* There was a minimum time before this could not happen.<p>* Wi-Fi tracks the time since the network started; even before the bug showed as performance problems, ours would be claiming to be sixty thousand years old, and getting older by about two thousand years a day.<p>* This is done with a time variable called the TSF that is in units of 802.11 TU, each 1.024 microseconds, since the time the network was set up.<p>* The slow nodes would be unable to receive for up to 90% of the time, but could transmit fine and were always received properly even by another slow node.<p>* Wi-Fi devices at the time were terrible at selecting good transmitter settings, and we could do much better at that; we fixed that problem, and while it was not stuck slow the network got ten times faster, but this fix actually made the slow node problem worse; the slow nodes were much slower, and the contagion spread faster.<p>One day we got so tired of this problem, we decided that we were going to sit in a conference room with all our kernel developers together, put the source code on the projector screen, and read it all together.<p>The TSF is formally a 64 bit number, but is handled in various places in 24, 32, and 48 bit suffixes, with code having to determine the missing bits.<p>We started with the file that defined the basic data structures of the Wi-Fi stack. We got a few dozen lines into that file, and spotted a line of code that I now can’t find, but it defined the type of variable that would be used to handle time values. And it said that the TSF would be a 32 bit integer. And we all looked at that line of code, and eventually I said “u32 TSF? Wonder if the arithmetic is all correct on that…”. We went and looked at every place it was used, and couldn’t figure out if it was or not.<p>So we decided to do the obvious thing, and change it to a 64 bit integer. Then we rebuilt our code and rebooted the network, which took a good week to do.<p>Three months later, the network was still fine and we declared we had fixed it.<p>We tested every Wi-Fi device we could lay our hands on, and about 3/4 of them had the same bug. The ones that could run different operating systems, mostly Apple laptops, sometimes had the bug in two or three operating systems. We reported this problem to everyone we could find: Apple, Microsoft, four chip manufacturers, and so on.<p>It turned out that there were quite a lot of implementations that were much worse: instead of using a 32 bit number, they had used 24 bits, and then their ad-hoc mode networks would fail after 4 hours and 46 minutes…<p>But if you wonder why we have Bluetooth for so many things when Wi-Fi could do just as well or better… this bug is the reason, I believe. Wi-Fi just wasn’t reliable in ad-hoc mode during the critical period of time, and Bluetooth became the way to do these things.