My Hardest Bug Ever (2013)

289 pointsby whackabout 2 years ago

44 comments

lukeadamsonabout 2 years ago

One of my favorites:I was working a large project for a wafer fab company, and occasionally the compiler would crash during full builds with SIGILL (illegal instruction, for those who aren’t familiar with the signal). Compiler bugs are never fun, and this was particularly vexing because it was so inconsistent.It took me awhile, but eventually I got around to thinking: What could cause the compiler to execute an illegal instruction? What could cause an illegal instruction at all?I removed the outer case from my computer, and sure enough, all of the fans had died. The CPU was overheating during intense, long-running builds. Replaced the fans and the “bug” went away!*This is my first comment since I created my account in 2009. I hope I did it right! ;-)

评论 #35067468 未加载

评论 #35066665 未加载

评论 #35084769 未加载

评论 #35066688 未加载

bmcooleyabout 2 years ago

Not a hardware bug, but in embedded I ran into a fun one early into my first job. I setup a CI pipeline that took a PR number and used it as the build number in a MAJOR.MINOR.BUILD scheme for our application code. CI pipeline done, everything worked hunky-dory for a while, project continued on. A few months later, our regression tests started failing seemingly randomly. A clue to the issue was closing the PR and opening a new one with the exact same changes would cause tests to pass. I don’t remember exactly what paths I went down in investigation, but the build number ended up being one of them. Taking the artifacts and testing them manually, build number 100 failed to boot and failed regression, build 101 passed. Every time. Our application was stored at (example) flash address 0x8008000 or something. The linker script stored the version information in the first few bytes so the bootloader could read the stored app version, then came the reset vector and some more static information before getting to the executable code. Well, it turns out the bootloader wasn’t reading the reset vector, it was jumping to the first address of the application flash and started executing the data. The firmware version at the beginning of the app was being executed as instructions. For many values of the firmware version, the instructions the data represented were just garbage ADD r0 to r1 or something, and the rest of the static data before getting to the first executable code also didn’t happen to have any side effects, but SOMETIMES the build number would be read as an instruction that would send the micro off into lala land, hard fault or some other illegal operation. Fixed the bootloader to dereference the reset vector as a pointer to a function and moved on!

评论 #35065251 未加载

评论 #35067870 未加载

评论 #35066870 未加载

PaulDavisThe1stabout 2 years ago

Early 90s, doing the first implementation of scheduler activations in a real kernel on a real machine. There's an occasional bug that shows up, we think it's a race condition or something. After lots and lots of debugging and thinking, end up in the debugger approaching a line where we think the bug manifests (not caused, but manifests). Looks something this:<pre><code> int g = 2; if (g) { printf ("yes\n"); } else { printf ("no\n"); } </code></pre> Obviously most of the time we see "yes", but every once in a while we see "no". Even in the debugger, using stepi, we hit the conditional, we confirm with the debugger that g is indeed non-zero. Totally impossible for the conditional to ever print "no", right?------------Well, when you're writing a re-entrant kernel context switch (as scheduler activations requires), you'd better damn well remember to restore ALL the registers on the processor, in particular the one that stores the result of a recent compare instruction.We had skimped on this tiny step IIRC, one extra instruction in the context switch code); the kernel is interrupted after the compare instruction but before the jump; scheduler activations dictates switching to a new thread; when we come back to the original thread, the apparent result of the comparison is reversed, and we print "no".At least the paper got an award at Usenix that year :)

评论 #35064231 未加载

评论 #35065235 未加载

评论 #35071510 未加载

lukeadamsonabout 2 years ago

Another favorite:Once upon a time, we got a panicked email from a customer whose OmniOutliner file would no longer open. He’d written a novel in it and was understandably keen to not lose his work.Sure enough, when we opened his file with the debugger attached, it crashed immediately. Curiously, the crash was deep inside Apple’s XML parsing code, which we used indirectly by saving the file in their XML-variant of a property list.Looking at the file in a text editor, we eventually found a funny-looking character where there should’ve been an angle bracket (an opening or closing bracket of an XML element). Inspecting it in a hex editor revealed that the difference between the actual character and what it should’ve been was precisely one bit.How on Earth could that happen?! A bit more sleuthing (haha) uncovered more of these aberrations, and it didn’t take long before we realized that they occurred at regular intervals.We patched it up, emailed it back to the customer, and suggested he check his RAM. He soon replied, thanking us but then asking, “How did you know I had bad RAM from my novel?!”

评论 #35073995 未加载

评论 #35073849 未加载

nlabout 2 years ago

This was my worst bug, in the JVM(!) back in 2002: <a href="https://www.artima.com/forums/flat.jsp?forum=121&thread=10119" rel="nofollow">https://www.artima.com/forums/flat.jsp?forum=121&thread=1011...</a>> under JDK1.4.1 once 2036 files are open any subsequent opens will delete the file that was supposed to be opened.Obviously this is bad.It was worse to debug. "Opening files" includes opening Java class files or JARs, so we'd see a system with some class files or jars missing and spent ages trying to work out why deployment was failing.Then I saw files class files disappear in front of me while I was using the system. That was one of the biggest WTF moments of my career. I assumed someone else was on the computer, then I assumed a virus, then hardware corruption.It didn't occur to us to think the JVM would delete files instead of opening them for a long time.Here's the reference in the Java bug database: <a href="https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4779905" rel="nofollow">https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4779905</a>

评论 #35064930 未加载

gumbyabout 2 years ago

My most memorable hardware bug was noware near as hard as this, but I'll never forget it.Intel was trying to sell the 960s and sent us a dev board with that CPU. Nobody in the company could get it to boot up. It would power up but nothing would show up on the serial port. Eventually it was my turn to look and for some reason I happened to notice a pullup capacitor on the UART VCC. I looked at the schematics and indeed it was there. A simple jumper to bypass it (back in those days we had big, manly components; none of that surface mount shit) and what hey: the serial console responded. It had booted up just fine, but was mute.After that we could do development but it was immediately clear to me that the 960 was DoA. It's not like we were the first to get that board!

评论 #35063676 未加载

评论 #35066571 未加载

评论 #35062957 未加载

cushychickenabout 2 years ago

I found a silicon bug in a memory chip once.The chip was supposed to read out a unique ID, but instead read out all zeros. Doubly weird, because it was a flash chip. You’d expect a blank flash chip to spit out all 0xff, not all 0x00.I ran it past the lead EE, and the lead software engineer, and the chip co FAEs, and they all said I must have done something wrong.But they all came back later having repro’ed my demo.Two months of kicking it up the chip co later, I got a nice note from the CEO of that chip company saying “Thanks for the bugfix” - with a bottle of Dom Perignon.That was a cool career highlight.

winridabout 2 years ago

My recent weird "bug" was when I installed a new Linux distro, just last week, to get away from weird graphical issues with KDE (switched to PopOS for hardware support).On boot, my mouse started moving really erratically. I would try to move it and it would just jump all around the screen, but only with Razor mouse, not my logitech one.Great, I think, I traded display issues for mouse driver issues. But it was weird, because it was fine during the live USB.I spent a bit of time debugging inputs etc, maybe it's a weird driver issue.I suddenly remembered in my HS days when the school ordered new mousepads which had bright yellow lines on them from some logo, making them incompatible with the "new" laser mice.It was some cat hair on the sensor :D

metadatabout 2 years ago

Previous discussions (slightly tricky to find because the URL has changed)<a href="https://news.ycombinator.com/item?id=6654905" rel="nofollow">https://news.ycombinator.com/item?id=6654905</a> (November 2013; 81 comments)<a href="https://news.ycombinator.com/item?id=9738302" rel="nofollow">https://news.ycombinator.com/item?id=9738302</a> (June 2015; 29 comments)<a href="https://news.ycombinator.com/item?id=14394095" rel="nofollow">https://news.ycombinator.com/item?id=14394095</a> (May 2017, 7 comments)

chasd00about 2 years ago

I was working on a rx drug pricing system right out of college. I couldn’t always get my price calculations to match what a major insurance carrier came up with and the contract clearly stated the formula. Turned out the big carrier had a bug in their calculations that surfaced only under a specific set of circumstances. I felt very proud of myself for figuring out their big and did a detailed write up and submitted it to the carrier. Their response was “yeah we know, we’re not going to fix it though”. That floored me but I was right out of college and pretty naive hah.

toolsliveabout 2 years ago

I once (about 10y ago) experienced hardware that got tired. A customer replaced the usual hard disks with shiny new Seagate SMR drives, because they had more storage capacity. Funny thing is that they could not handle the sustained 100MB/s we were feeding them. So after about 20 minutes they started slowing down and after half an hour they stopped working for about 20 minutes and then they were fine again. Obviously the customer complained about our storage product and forgot to mention this small fact. Once we figured it out we had good laugh.

评论 #35062286 未加载

评论 #35068147 未加载

whitewingjekabout 2 years ago

Previously discussed:<a href="https://news.ycombinator.com/item?id=6654905" rel="nofollow">https://news.ycombinator.com/item?id=6654905</a> (81 comments)<a href="https://news.ycombinator.com/item?id=9738302" rel="nofollow">https://news.ycombinator.com/item?id=9738302</a> (29 comments)<a href="https://news.ycombinator.com/item?id=14394095" rel="nofollow">https://news.ycombinator.com/item?id=14394095</a> (7 comments)

评论 #35061528 未加载

notbeullerabout 2 years ago

I ran into this one the other day[1] - similarly amusing iOS / macOS debugging (by the guy that wrote much of the modern objc runtime)[1] <a href="http://www.sealiesoftware.com/blog/archive/2010/09/01/Dr_Gregory_Parker_Department_of_Diagnostic_Engineering.html" rel="nofollow">http://www.sealiesoftware.com/blog/archive/2010/09/01/Dr_Gre...</a>

评论 #35066358 未加载

eyelidlessnessabout 2 years ago

This story is fascinating in a lot of ways, but one which jumps out at me is: I don’t think the particular pre-“aha!” wondering about timing would ever occur to me in the domains I’ve worked. I guess maybe I’d discover it in the repro isolation process because that elimination is often very illuminating (it’s basically how I taught myself to program!), but it wouldn’t ever come to mind unless I was staring at it while debugging.Say what you want about the ills of high level abstractions, but not having to think about the implementation details of clock sync all the way down to the metal is a pretty nice convenience when you can afford it.

rramadassabout 2 years ago

One of mine;I was implementing a TCP split proxy(using Adam Dunkels' lwIP stack) on a custom SoC with a 16-way multi-core(ARM+MIPS ISA mishmash) for the data plane. Memory was divided into different regions each with a specific set of policies. I had gotten my single-core proxy working and then added a Mutex to the TCP control block to parallelize my code across all the cores. Testing resulted in a fatal crash. After rolling back the checkins one by one, i narrowed down the problem to the load-link/store-conditional instructions(LL/SC) used to implement the Mutex. Now i was stuck with no clue as to why executing these instructions resulted in a crash. Cue me cursing everything about the chip in my cubicle. One of the senior engineers who was there in the beginning during the design of the SoC and hence knew its quirks heard my lamentation, came over, took a look, and promptly solved the problem. Remember the different policies for the different memory regions i mentioned earlier? It turns out that i had placed my TCP control bock and hence the Mutex in it in a certain region of memory where LL/SC instructions were inadmissible thus resulting in the crash. Shifting that data structure to a different region of memory solved the problem.Lesson learned: When working on a custom SoC take nothing for granted even hardware instructions.

andreareinaabout 2 years ago

A but of a nit but I think cross-talk is well modeled by classical signal theory, there's no need to invoke quantum mechanics, no?

评论 #35064861 未加载

glonqabout 2 years ago

Having spent the better part of 30 years working on/with/around embedded systems, I can't even count how many bugs I've bumped into that were hiding inbetween sofware and hardware. Or between software and compiler/tools/OS. Or between hardware and spooky RF black magic.

评论 #35064885 未加载

Saigonauticaabout 2 years ago

My favorite bug this month was while setting up a development environment with the AVR-ICE.I tried to save some company money by not buying the (optional) case and programming cable assembly -- figured I could just use another not-80$ SWD cable (also 3d-printed a case and a SOT-23-6 programming adapter).After much cursing and hair pulling, I noticed that the header for the SWD cable was installed upside-down on the PCB. So the red wire on the ribbon cable was pin 10 instead of pin 1. In their defense, they did correctly indicate this on the solder mask, I just didn't see it through the (opaque) case.My best guess as to why the cable assembly costs 80$ is that they again reverse the pin order on it to silently fix the bug on the PCB instead of just shipping a standard cable.It turned out to be worth the engineering time to deal with the bug, but not by as much as I hoped. It's a pretty neat product despite this bug, definitely more modern than the venerable STK500 that I used previously (which itself had been converted to a USB device after the level converter failed).

grujicdabout 2 years ago

Worked on a chart generating service in Java some 20 years ago. At that time IBM released their JVM. Upon first tests it worked perfectly and significantly faster than Sun's JVM. After testing it further, making tens of thousands charts, we deployed it to production. However, in production it would stop misteriosly after some time. Added a lot of logging, there were no issues in our code. After a while I realized it failed somewhere after 65536 charts were made! That was pretty suspicious. There's nothing in our code that would overflow some 16-bit counter, it worked under another JVM, and crash was not a Java exception. If I remember correctly it was not even a crash at all but entire process would freeze.It turned out it was a problem with that specific IBM JVM. We created a new thread for each chart, and that JVM froze after 65536 created threads! Moral of the story, if you already test with tens of thousands requests, make sure it's at least 64k tests.

8fingerlouieabout 2 years ago

A decade ago i worked with sortation devices for postorder companies, and one of our clients reported that they sometimes had issues with items that were sorted wrong, but were unable to reproduce it. They used trays for sorting, and each tray had a barcode with a uniqe id.I spent a LONG time looking at logs until i ended up enabling debug logging, and because the site was on a 1200 baud modem i had the client burn the logs to a DVD media and ship them to us.I ended up writing i piece of perl code to parse the logs and insert them into a MySQL database where i could then trace the individual sorter trays by id, and by some obscure miracle of sleep deprevation and too much coffee, i manage to find a correlation.Turns out the bug only showed up when a tray had been used for sorting inbound items, then reused for sorting outbound items, and when used for sorting inbound items again (not outbund, which would reset it), then the bug would happen.The fix was traced to a single line in an if/else statement.Time to fix : around 1 hour including tests.Time to find bug : around 300 hours.Of something more relevant to the article, i used to write operating systems for mobile phones, and we spent A LONG time debugging an issue where our brand new display driver was acting up.After attaching a lauterbach debugger we finally managed to track it down to the compiler.Turns out :<pre><code> int i = 1+2+3; </code></pre> would mean i=3 in the code as the compiler only considered the first two variables in the assignment list.Another fun feature of that compiler was the fact that when you incremented heap memory past the memory page, it would forget to increment the page pointer, meaning it simply just wrapped to 0 and the memory you referenced was nowhere near what you'd expect :)

jalbertoniabout 2 years ago

I once did some low-level GPU programming on a project aimed for the Samsung Galaxy S8. It was a case with extra features like an iris and fingerprint scanner, connected via the USB port.It would work perfectly on our test phone and occasionally crash on other phones. Long story short, we narrowed it down to it crashing on phones with a specific SoC that was used in other parts of the world.For some reason, when you copied an image straight from the phone camera (used to recognize and align eyes compared to the infrared iris camera) to the GPU and tried to access it, it would segfault in the non-western SoC. The data wasn't initialized yet.My (hurried, we were releasing next month) fix was to add a rsdebug("This fixes a crash!\0"); to the code. The extra delay to go to the kernel and back fixed the race condition almost all the time. Someone later fixed my code from 99.99% stable to 100%, but I was in another project by that time, so I have no idea what they did.

FpUserabout 2 years ago

My most horrible hardware "bug" that drove me up the wall was long forgotten wireless keyboard stuffed in a closet and acting up with my PC when cat would decide to visit. The distance was just right that it was very intermittent.

评论 #35063203 未加载

ezekgabout 2 years ago

> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.I wish this were the case. The average programmer blames whatever library/third-party/etc. they're using, then somewhere around the 10,000th they might blame their own code.(I run a third-party service and everything is always my fault, even syntax errors.)

评论 #35063094 未加载

评论 #35061970 未加载

评论 #35062167 未加载

marcodiegoabout 2 years ago

A friend of mine learned to write linux kernel drivers. We had to write tests for hardware of a big name manufacturer. We had a USB cd-rom drive that failed some test unexplainably. We contacted the engineers and they were very responsive about the firmware, we also performed many mechanical tests. My friend decided to flex his driver writing muscles and spent a whole day modifying the linux kernel to carefully investigate bit by bit what was being sent from the cd-rom. After a very long investigation he categorically said: "It can only be the cable."It was the cable indeed. And we could have discovered with much less effort.

GlenTheMachineabout 2 years ago

Oh man.I was writing the motor controller code for a new submersible robot my PhD lab was building. We had bought one of the very first compact PCI boards on the market, and it was so new we couldn't find any cPCI motor controller cards, so we bought a different format card and a motherboard that converted between compact PCI bus signals and the signals on the controller boards. The controller boards themselves were based around the LM629, an old but widely used motor controller chip.To interface with the LM629 you have to write to 8-bit registers that are mapped to memory addresses and then read back the result. The 8-bit part is important, because some of the registers are read or write only, and reading or writing to a register that cannot be read from or written to throws the chip into an error state.LM629s are dead simple, but my code didn't work. It. Did. Not. Work. The chip kept erroring out. I had no idea why. It's almost trivially easy to issue 8-bit reads and writes to specific memory addresses in C. I had been coding in C since I was fifteen years old. I banged my head against it for two weeks.Eventually we packed up the entire thing in a shipping crate and flew to Minneapolis, the site of the company that made the cards. They looked at my code. They thought it was fine.After three days the CEO had pity on us poor grad students and detailed his highly paid digital logic analyst to us for an hour. He carted in a crate of electronics that were probably worth about a million dollars. Hooked everything up. Ran my code."You're issuing a sixteen-bit read, which is reading both the correct read-only register and the next adjacent register, which is write-only", he said.Is showed him in my code where the read in question was very clearly a *CHAR*. 8 bits."I dunno," he said - "I can only say what the digital logic analyzer shows, which is that you're issuing a sixteen bit read."Eventually, we found it. The Intel bridge chip that did the bus conversion had a known bug, which was clearly documented in an 8-point footnote on page 79 of the manual: 8 bit reads were translated to 16 bit reads on the cPCI bus, and then the 8 most significant units were thrown away.In other words, a hardware bug. One that would only manifest in these very specific circumstances.We fixed it by taking a razor knife to the bus address lines and shifting them to the right by one, and then taking the least significant line and mapping it all the way over to the left, so that even and odd addresses resolved to completely different memory banks. Thus, reads to odd addresses resolved to addresses way outside those the chip was mapped to, and it never saw them. Adjusted the code to the (new) correct address range. Worked like a charm.But I feel bad for the next grad student who had to work on that robot. "You are not expected to understand this."

评论 #35064892 未加载

评论 #35062224 未加载

评论 #35063812 未加载

评论 #35065327 未加载

nessus42about 2 years ago

Don't even get me started about the short section of RG-59 cable spliced into the middle of our RG-58 Ethernet. Hidden in the ceiling.

评论 #35067193 未加载

评论 #35066675 未加载

dale_glassabout 2 years ago

Here's a couple of my most fun ones.1. Mac build crashes with "illegal instruction" due to AVX512 instruction that the Mac CPU doesn't support. Problem is though that the AVX512 code is in its own file, and this particular function is only called if AVX512 is supported by the CPU. So this code should never even run, and in fact it doesn't, so what gives?Turns out that the AVX file is compiled with -mavx512f (sensible enough), and that this file includes a header that defines:<pre><code> const float SQUARE_ROOT_OF_2 = (float)sqrt(2.0f); </code></pre> Turns out that GCC compiles this to code including AVX512 instructions, which get executed completely bypassing the "if AVX512 is supported" check.Fix: Change the constant to a numeric value.2. Code crashes oddly on shutdown. Debugging shows destructors run twice.Turns out the project is split into a large number of libraries one of which is 'shared', and includes very general purpose stuff like logging. 'shared' then gets linked into other libraries, which get linked into the resulting binary.When linking statically this has the fun result that libfoo links to libshared, libbar links to libshared, and then libfoo and libbar make up the binary. Now there are two copies of libshared that end up in the binary, and this results in static variables being constructed and destructed twice.

ivolimmenabout 2 years ago

Software but a nice bug: A really long time ago I worked at a company creating a nice portal application in ASP.NET (version 1.1) for a client. Was cool to build. Client did not follow our guide on how to install the application. We tolled them it should run on a separate machine and they just crammed it together with 7 other web applications. Since the portal had a login feature where people could change their resume with was quite sensitive. At the certain time we got a call that people saw each others resume. We used most desktops in our office to simulate the issue and wrote scripts to simulate different users. it took weeks to finally reproduce it. As it turned out it was not our problem but a bug in ASP.NET... we had lots of calls with multiple offices of Microsoft. At some point we heard nothing back from them. We wrote our own state manager to avoid the issue but that also did not solve it. A few months later .NET 2.0 came out. One of the item in the release notes was that a fix was made that when too many requests on a ASP.NET server (IIS) would make the http.dll (not sure on the name) serve a cached version of the previous request.. We lost the account and a 100K of work that was never paid and we almost wend to court on this one...

einpoklumabout 2 years ago

> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.So, first - in many settings, the hardware is more likely to be the source of the problem than your compiler; the question is what has more churn - the compiler code or the chip you run on.But regardless - the compiler is much higher than the 10,000'th item on the blame list. Even mature, popular compilers have bugs! Hell, they have many known, open bugs! The subtle ones, which don't manifest easily, can stay open for quite a long time. See:<a href="https://gcc.gnu.org/bugzilla/" rel="nofollow">https://gcc.gnu.org/bugzilla/</a>and:<a href="https://bugs.llvm.org/" rel="nofollow">https://bugs.llvm.org/</a>I personally have encountered and even filed several of them, and it's not like I was trying. Some of these were even the result of "Why does my code not work?" questions on StackOverflow.One tip, though: Play one compiler against another when you begin suspecting your compiler, or the hardware. The buggy behavior will often be different. And of course run multiple times to check for variation in behavior, like the author had.

评论 #35062936 未加载

评论 #35062182 未加载

rompicabout 2 years ago

My favorite one: The company I once worked for used an outdated version of sqlite (3.8.6) in one of their products. The databases used got bigger and bigger and in a very big project one of the "already known to be slow"-queries took more than an hour on my laptop making the tool unusable. On a quiet day, I was able to save the temporary table used as part of the process and run the problematic query against it in an isolated fashion.The query returned an extremely high number of results and when I discovered this I questioned my SQL-fu, my sanity and my trust in computers.I found that we were hit by a bug that was fixed 6 years before I discovered it (<a href="https://sqlite.org/src/info/6f2222d550f5b0ee7ed" rel="nofollow">https://sqlite.org/src/info/6f2222d550f5b0ee7ed</a>). Sqlite's query planner assumed that a field with a not null constraint can never be null, which isn't the case for the right hand table in a left join.I fixed it by adding a not null check in the query and then later by updating the library. After that the 1 hour query ran in ~700 ms.

gendalabout 2 years ago

Not a "hard" bug but a useful lesson in any case. I worked on a set of stress tests for a major middleware product and came into the office on a Monday morning to check the 72-hour over-weekend runs. We were getting close to release date and things were settling down so I wasn't expecting anything major. Except they'd ALL failed. It took us far longer than I'd care to admit to figure out what had gone wrong - I wasn't working on it non-stop but I definitely remember it taking quite some time. I think it was a colleague who figured it out later that week.Anyway, what had happened was that our Perl test harness was tracking time elapsed in the 72-hour run as seconds since the Unix epoch, but was comparing them using the lexicographical order operator (lt versus <). Everything worked until the time ticked over from 999,999,999 seconds to 1,000,000,000.I just looked up those timestamps to check my memory, and I can now see why fixing it wasn't our top priority that week... the 999999999/1000000000 transition happened the weekend before 9/11.

robinzfcabout 2 years ago

Many years ago I was working on a device driver for a position sensor. After deployment the customer complained that every time they started another process on the monitoring machine where the driver was installed the position sensor readings registered a slight movement of the object the position sensor was attached to. The object was about 20m away from the monitoring machine and weighed many tons (20?). After hearing the report I remarked that it looked like the first ever documented case of telekinesis. They were not amused. After a cross-continent trip to the customer site to instrument everything it turned out that due to the physics of the sensor (ultrasound waves traveling in a metal rod, reflecting from a magnet) the exact reading was slightly sensitive to the time elapsed from the previous reading, which in turn depended on the CPU load on the host machine. I fixed that in the driver by making sure the probing of the sensor happened in fixed time intervals, independent of the sensor reading frequency from the user space.

agraceyabout 2 years ago

My weirdest was a server that would randomly stop responding to traffic. Debugging for multiple days (including full factory resets) only to figure out that the clip had broken and would disconnect occasionally depending on air flow through the rack. The link light would stay on so there was no way to tell by looking at it :(

AceJohnny2about 2 years ago

Working in embedded systems nowadays, it's funny to think of a time where a hardware designer would claim its impossible for it to be a HW bug.Perhaps it was rarer back then, but these days cross-talk is carefully addressed, and my HW designer friends have nightmares of these kind of issues slipping through.

nicwilsonabout 2 years ago

One of my worst ones was a compiler bug for a PLC which would cause a floating point operation that underflowed to become NaN instead of 0.0 (which is very common if you are writing set-point tracking code!) and then throw out the loop, so the controller would reach set point and then slowly start drifting as is accumulated error, but only for the rest of the "line"* of that code. So if you split your calculations across multiple lines then you were fine, but if you tried to group your operations sensibly then it no linger worked. PLCs do the loop for you, you only write the body*it was some bastardised version of ladder logic (itself a bastard representation of code) with functional blocks, so "line" = rung. I no longer work with PLCs.

unwindabout 2 years ago

I have one!My memory is pretty crap, and I've been around the block a few times so I'm not stating that this truly is my worst bug but it was ... bad.I was working in embedded, developing part of the control software for ... something. The microcontroller had a vendor-developed C compiler, with some extensions. It was a 16-bit chip more or less, so addressing large areas of memory was complicated. The flash was larger than 64 KB, so in order to write all of it you had to use more than 16 bits.Luckily, the vendor compiler had an extension like in the DOS days, where you could add the proprietary "far" modifier to a pointer in order to signal that you wanted lots of range. Like this:<pre><code> unsigned char far *ptr = FLASH_BASE; </code></pre> or something. Imagine then my surprise when I was looping over the flash (I think I was computing a CRC to validate software integrity, or something) and<pre><code> ptr++; </code></pre> simply failed to reach all of it. I read the generated code, and the compiler was emitting 16-bit arithmetic, completely ignoring the fantastic "far" modifier. I changed it to something like<pre><code> ptr = (unsigned char *) ((uint32_t) ptr + 1); </code></pre> and got the proper code, and it worked.At that point, I was like "whoa, I found a compiler bug, gonna report it!" and sent off the details to our field applications engineer from the vendor....Who came back with "yeah, we know, but we choose this behavior since it gives better performance" or something along those lines.That just completely killed my trust in that vendor, and any interest in working with them again. As a person who has been writing code for close to 40 years, that kind of attitude just blows my mind, and really makes me upset. You're supposed to trust the compiler, a compiler bug should be rare. Correctness is important, these here programming things are hard enough without having the compiler lie to you.Gosh, it makes me upset even now just thinking about it. Heh.*Goes back to hugging open source compilers.*

jcmeyrignacabout 2 years ago

My hardest bug ever was also hardware related. Back in 1998, I was working on a game called "Trucks". When playing with the network, I noticed that the game was sometimes desynchronized.To understand the problem, I had to save tons of logs and manually compared them, in order to find what happened. After a large effort, I discovered that some floating-point values were different. Then, I realized that some of our computers were Pentium with the FDIV bug.

paper2dabout 2 years ago

I faced a similar quantum bug in my teen years. I was very much into Android custom roms and flashing phones. I got my hands on Galaxy Y, a cheap Android phone running gingerbread. So, while flashing the phone, the the flash process always failed after around 20-30%. I suspected loose cable connection and tried again. The subsequent flashes failed even early around 5%. So I waited for a while and tried again. The same loop started - first flash fails around 20% and subsequent flash fails around 5%.During this, I noticed the phone gets heated up above average of what I had experience with. I suspected motherboard might be faulty. Then, a random idea struck me and I put the phone in freezer wrapped in a cloth. After an hour or so, I again started the flash process. The phone was still wrapped in cloth aside the table to keep it cool during the process. And lo and behold, the process completed without a hitch.In later years I realised it was surely due to the cheap and substandard flash memory that Samsung supplies to other countries compared to the western counterparts.

cachvicoabout 2 years ago

I've seen things you people wouldn't believe... JVMs leaking memory on Ericsson and Motorola dumbphones... I watched devs work without debuggers or console. All those moments will be lost in time, like tears in rain... time to retire.

评论 #35073983 未加载

roflyearabout 2 years ago

Outside of really weird shit™, the hardest part of software development is working with broken systems. Just today, I was trying to do something that is critical for a large chunk of our software working. I had to:- Realize the API wasn't working in strange ways- Talk to the team, who are unhelpful- Try to figure out what is wrong, but fail- Try alternative ways to do what we need to do- Come to a lot of dead ends, or working solutions that were not viable- Discuss more with our teams- Eventually realize what needed to be done to get the correct output (API is confirmed broken in strange ways)- Implement this, just to continue to do what I am actually trying to accomplishI'd like to see ChatpGPT do that :)

评论 #35063941 未加载

geenewabout 2 years ago

The collision between the words 'crash' and 'Crash' whilst developing Bandicoot must have been a conundrum.

评论 #35065447 未加载

cocodillabout 2 years ago

Crash Bandicoot PS1 stories are the best stories.

yellow_leadabout 2 years ago

2013

评论 #35061882 未加载

oifjsidjfabout 2 years ago

I've never seen such annoying ads on any website: the ad size changes every ~30 seconds which rearranges the text flow of the article completely and I get lost.

评论 #35061936 未加载

评论 #35062983 未加载

44 comments

lukeadamsonabout 2 years ago

评论 #35067468 未加载

评论 #35066665 未加载

评论 #35084769 未加载

评论 #35066688 未加载

bmcooleyabout 2 years ago

评论 #35065251 未加载

评论 #35067870 未加载

评论 #35066870 未加载

PaulDavisThe1stabout 2 years ago

评论 #35064231 未加载

评论 #35065235 未加载

评论 #35071510 未加载

lukeadamsonabout 2 years ago

评论 #35073995 未加载

评论 #35073849 未加载

nlabout 2 years ago

评论 #35064930 未加载

gumbyabout 2 years ago

评论 #35063676 未加载

评论 #35066571 未加载

评论 #35062957 未加载

cushychickenabout 2 years ago

winridabout 2 years ago

metadatabout 2 years ago

chasd00about 2 years ago

toolsliveabout 2 years ago

评论 #35062286 未加载

评论 #35068147 未加载

whitewingjekabout 2 years ago

评论 #35061528 未加载

notbeullerabout 2 years ago

评论 #35066358 未加载

eyelidlessnessabout 2 years ago

rramadassabout 2 years ago

andreareinaabout 2 years ago

A but of a nit but I think cross-talk is well modeled by classical signal theory, there's no need to invoke quantum mechanics, no?

评论 #35064861 未加载

glonqabout 2 years ago

评论 #35064885 未加载

Saigonauticaabout 2 years ago

grujicdabout 2 years ago

8fingerlouieabout 2 years ago

jalbertoniabout 2 years ago

FpUserabout 2 years ago

评论 #35063203 未加载

ezekgabout 2 years ago

> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.I wish this were the case. The average programmer blames whatever library/third-party/etc. they're using, then somewhere around the 10,000th they might blame their own code.(I run a third-party service and everything is always my fault, even syntax errors.)

评论 #35063094 未加载

评论 #35061970 未加载

评论 #35062167 未加载

marcodiegoabout 2 years ago

GlenTheMachineabout 2 years ago

评论 #35064892 未加载

评论 #35062224 未加载

评论 #35063812 未加载

评论 #35065327 未加载

nessus42about 2 years ago

Don't even get me started about the short section of RG-59 cable spliced into the middle of our RG-58 Ethernet. Hidden in the ceiling.

评论 #35067193 未加载

评论 #35066675 未加载

dale_glassabout 2 years ago

ivolimmenabout 2 years ago

einpoklumabout 2 years ago

> As a programmer, you learn to blame your code first, second, and third... and somewhere around 10,000th you blame the compiler. Well down the list after that, you blame the hardware.So, first - in many settings, the hardware is more likely to be the source of the problem than your compiler; the question is what has more churn - the compiler code or the chip you run on.But regardless - the compiler is much higher than the 10,000'th item on the blame list. Even mature, popular compilers have bugs! Hell, they have many known, open bugs! The subtle ones, which don't manifest easily, can stay open for quite a long time. See:<a href="https://gcc.gnu.org/bugzilla/" rel="nofollow">https://gcc.gnu.org/bugzilla/</a>and:<a href="https://bugs.llvm.org/" rel="nofollow">https://bugs.llvm.org/</a>I personally have encountered and even filed several of them, and it's not like I was trying. Some of these were even the result of "Why does my code not work?" questions on StackOverflow.One tip, though: Play one compiler against another when you begin suspecting your compiler, or the hardware. The buggy behavior will often be different. And of course run multiple times to check for variation in behavior, like the author had.

评论 #35062936 未加载

评论 #35062182 未加载

rompicabout 2 years ago

gendalabout 2 years ago

robinzfcabout 2 years ago

agraceyabout 2 years ago

AceJohnny2about 2 years ago

nicwilsonabout 2 years ago

unwindabout 2 years ago

jcmeyrignacabout 2 years ago

paper2dabout 2 years ago

cachvicoabout 2 years ago

评论 #35073983 未加载

roflyearabout 2 years ago

评论 #35063941 未加载

geenewabout 2 years ago

The collision between the words 'crash' and 'Crash' whilst developing Bandicoot must have been a conundrum.

评论 #35065447 未加载

cocodillabout 2 years ago

Crash Bandicoot PS1 stories are the best stories.

yellow_leadabout 2 years ago

2013

评论 #35061882 未加载

oifjsidjfabout 2 years ago

I've never seen such annoying ads on any website: the ad size changes every ~30 seconds which rearranges the text flow of the article completely and I get lost.

评论 #35061936 未加载

评论 #35062983 未加载