It’s been more than a decade since I worked in games, but for my entire games career, any and all use of C++ exceptions was strictly disallowed due to other horror stories. I was still bit in a very similar way by someone’s C++ copy constructor trickery - a crash that only happened in a release build after playing the game for a while, with a stack corruption. Like the author, for me this was one of the hardest bugs I ever had to track down, and I ended up writing a tiny release mode debugger that logged call stacks in order to do it. Once I was able to catch the corruption (after several days of debugging during a crunch weekend), someone on my team noticed the stomp values looked like floating point numbers, and pretty quickly we figured out it was coming from the matrix class trying to be too clever with it’s reference counting IIRC. There’d been a team of around a dozen people trying to track this down during overtime, so it suddenly hit me once we fixed it that someone’s cute idea that took maybe 10 seconds to write cost several tens of thousands of dollars to fix.
Back in the day I used to consult for Sony Ericsson. We had like 5000 engineers writing C code that ran as a single executable in a single address space(!). Memory corruption was rampant. So rampant in fact that when we finally got an MMU it took months before we could turn it on in release builds, because there were so many memory corruption bugs even in the released product. The software just wouldn’t work unless it could overwrite memory here and there.
Related (and hilarious): <a href="https://scholar.harvard.edu/files/mickens/files/thenightwatch.pdf" rel="nofollow">https://scholar.harvard.edu/files/mickens/files/thenightwatc...</a><p>> What is despair? I have known it—hear my song. Despair is
when you’re debugging a kernel driver and you look at a memory dump and you see that a pointer has a value of 7. THERE IS
NO HARDWARE ARCHITECTURE THAT IS ALIGNED ON
7. Furthermore, 7 IS TOO SMALL AND ONLY EVIL CODE
WOULD TRY TO ACCESS SMALL NUMBER MEMORY.
Misaligned, small-number memory accesses have stolen
decades from my life.<p>All James Mickens' USENIX articles are fun (for a very specific subset of computer scientist - the kind that would comment on this thread). <a href="https://mickens.seas.harvard.edu/wisdom-james-mickens" rel="nofollow">https://mickens.seas.harvard.edu/wisdom-james-mickens</a>
> The project was quite big (although far from the largest ones); it took 40 minutes to build on my machine.<p>A bit tangential, but I've been crying about the insane Unity project build times for years now, and about how they've taken zero steps to fix them and are instead trying their hardest to sell you cloud builds. Glad to see them having to suffer through what they're inflicting on us for once!<p>Regardless, very good writeup, and yet another reason to never ever under any conditions use exceptions.
This kind of error is a right of passage with WIN32 programming. For example, to do nontrivial i/o on Windows you have to create an OVERLAPPED object and give it to ReadFile() and WriteFile() which will return a pending status code, and write back to your OVERLAPPED object once the i/o has completed. Usually it makes the most sense to put that object on the stack. So if you return from your function without making sure WIN32 is done with that object, you're going to end up with bugs like this one. You have to call GetOverlappedResult() to do that. That means no throwing or returning until you do. Even if you call CancelIoEx() beforehand, you still need to call the result function. When you mix all that up with your WaitForMultipleObjects() call, it ends up being a whole lot of if statements you could easily get wrong if the ritual isn't burned into your brain.<p>UNIX system calls never do this. The kernel won't keep references to pointers you pass them and write to them later. It just isn't in the DNA. The only exceptions I can think of would be clone(), which is abstracted by the POSIX threads runtime, and Windows-inspired non-standard event i/o system calls like epoll.
After trying and failing over several days to track down a squirrely segfault in a C project about 15 years ago, I taught myself Valgrind in order to debug the issue.<p>Valgrind flagged an "invalid write", which I eventually hunted down as a fencepost error in a dependency which overwrote their allocated stack array by one byte. I recall that it wrote "1" rather than "2", though, haha.<p>> <i>Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!</i><p>The author's debugging skills are impressive and significantly better than mine, but I find this an unsatisfying takeaway. I yearn for a systemic approach to either prevent such issues altogether or to make them less difficult to troubleshoot. The general solution is to move away from C/C++ to memory safe languages whenever possible, but such choices are of course not always realistic.<p>With my project, I started running most of the test suite under Valgrind periodically. That took took half an hour to finish rather than a few seconds, but it caught many similar memory corruption issues over the next few years.
Raymond Cheng faced a similar situation here: <a href="https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=110320" rel="nofollow">https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=11...</a><p>The problem boils down to usage of stack memory after the memory is given to somebody else.
Scary. I assume standard memory corruption detection tools would also have trouble finding this, as the write is coming from outside the application itself…
Very wild bug. I feel like this is some kind of a worst-case "exceptions bad" lesson, but I've only been doing systems level programming for a couple of years so I'm probably talking out my ass.
I don’t know windows programming and this was a very interesting (nightmare-ish) post.<p>I had a few questions I asked ChatGPT to understand better:
<a href="https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d3c" rel="nofollow">https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d...</a><p>Very interesting insights about low level programming in general
> The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select().<p>This is an absolutely standard trick that is known as "the self-pipe trick". I believe DJB created and named it. It is used for turning APIs not based on file handles/descriptors into events for an event loop based on file handles/descriptors, especially for turning signals into events for select(2)/poll(2)/epoll/kqueue/...
<i>WSPSelect</i>: 'Twas I who wrote "2" to your stack! And I would've gotten away with it, too, if it weren't for that meddling kernel debugger!
A small request: please stop using automatic translation for blog posts or documentation.<p>Especially when I still have English set as the second priority language.
Would memory safe languages avoid these kinds of problems? It seems like a good example of a nightmare bug from memory corruption - 5 days to fix and the author alludes to it keeping them up at night is a pretty strong motivation to avoid memory unsafety IMO.
Everyone here is going on about exceptions bad, but let's talk about QueueUserAPC(). Yeah, let's throw an asynchronous interrupt to some other thread that might be doing, you know, anything!<p>In the Unix world we have this too, and it's called signals, but every documentation about signals is sure to say "in a signal handler, almost nothing is safe!". You aren't supposed to call printf() in a signal handler. Throwing exceptions is unthinkable.<p>I skimmed the linked QueueUserAPC() documentation page and it says none of this. Exceptions aren't the handgrenade here (though sure, they're nasty) — QueueUserAPC() is.
The second piece of code I wrote for pay was a FFI around a c library, which had callbacks to send incremental data back to the caller. I didn’t understand why the documented examples would re-acquire the object handles every iteration through the loop so I dropped them. And everything seemed to work until I got to the larger problems and then I was getting mutually exclusive states in data that was marked as immutable, in some of the objects. I pulled my hair on this for days.<p>What ended up happening is that if the GC ran inside the callback then the objects the native code could see could move, and so the next block of code was smashing the heap by writing to the wrong spots. All the small inputs finished before a GC was called and looked fine but larger ones went into undefined behavior. So dumb.
TLDR: Kernel wrote memory back to a pointer provided by the user-mode program, as it was supposed to do. Unfortunately, it was a dangling pointer (Use-after-free)<p>When the Kernel does the memory write, user-mode memory debuggers don't see it happen.
Set a hardware breakpoint and you'll know immediately. That's what he eventually did, but he should have done so sooner.<p>Then obviously, cancelling an operation is always tricky business with lifetime due to asynchronicity. My approach is to always design my APIs with synchronous cancel semantics, which is sometimes tricky to implement. Many common libraries don't do it right.