Debugging memory corruption: who the hell writes "2" into my stack? (2016)

369 pointsby pierremenard5 months ago

25 comments

dahart5 months ago

It’s been more than a decade since I worked in games, but for my entire games career, any and all use of C++ exceptions was strictly disallowed due to other horror stories. I was still bit in a very similar way by someone’s C++ copy constructor trickery - a crash that only happened in a release build after playing the game for a while, with a stack corruption. Like the author, for me this was one of the hardest bugs I ever had to track down, and I ended up writing a tiny release mode debugger that logged call stacks in order to do it. Once I was able to catch the corruption (after several days of debugging during a crunch weekend), someone on my team noticed the stomp values looked like floating point numbers, and pretty quickly we figured out it was coming from the matrix class trying to be too clever with it’s reference counting IIRC. There’d been a team of around a dozen people trying to track this down during overtime, so it suddenly hit me once we fixed it that someone’s cute idea that took maybe 10 seconds to write cost several tens of thousands of dollars to fix.

评论 #42560376 未加载

评论 #42560296 未加载

评论 #42561529 未加载

bjornsing5 months ago

Back in the day I used to consult for Sony Ericsson. We had like 5000 engineers writing C code that ran as a single executable in a single address space(!). Memory corruption was rampant. So rampant in fact that when we finally got an MMU it took months before we could turn it on in release builds, because there were so many memory corruption bugs even in the released product. The software just wouldn’t work unless it could overwrite memory here and there.

评论 #42565322 未加载

评论 #42559604 未加载

jacinda5 months ago

Related (and hilarious): <a href="https://scholar.harvard.edu/files/mickens/files/thenightwatch.pdf" rel="nofollow">https://scholar.harvard.edu/files/mickens/files/thenightwatc...</a>> What is despair? I have known it—hear my song. Despair is when you’re debugging a kernel driver and you look at a memory dump and you see that a pointer has a value of 7. THERE IS NO HARDWARE ARCHITECTURE THAT IS ALIGNED ON 7. Furthermore, 7 IS TOO SMALL AND ONLY EVIL CODE WOULD TRY TO ACCESS SMALL NUMBER MEMORY. Misaligned, small-number memory accesses have stolen decades from my life.All James Mickens' USENIX articles are fun (for a very specific subset of computer scientist - the kind that would comment on this thread). <a href="https://mickens.seas.harvard.edu/wisdom-james-mickens" rel="nofollow">https://mickens.seas.harvard.edu/wisdom-james-mickens</a>

评论 #42561588 未加载

评论 #42562046 未加载

alexvitkov5 months ago

> The project was quite big (although far from the largest ones); it took 40 minutes to build on my machine.A bit tangential, but I've been crying about the insane Unity project build times for years now, and about how they've taken zero steps to fix them and are instead trying their hardest to sell you cloud builds. Glad to see them having to suffer through what they're inflicting on us for once!Regardless, very good writeup, and yet another reason to never ever under any conditions use exceptions.

评论 #42560047 未加载

评论 #42564684 未加载

评论 #42559993 未加载

jart5 months ago

This kind of error is a right of passage with WIN32 programming. For example, to do nontrivial i/o on Windows you have to create an OVERLAPPED object and give it to ReadFile() and WriteFile() which will return a pending status code, and write back to your OVERLAPPED object once the i/o has completed. Usually it makes the most sense to put that object on the stack. So if you return from your function without making sure WIN32 is done with that object, you're going to end up with bugs like this one. You have to call GetOverlappedResult() to do that. That means no throwing or returning until you do. Even if you call CancelIoEx() beforehand, you still need to call the result function. When you mix all that up with your WaitForMultipleObjects() call, it ends up being a whole lot of if statements you could easily get wrong if the ritual isn't burned into your brain.UNIX system calls never do this. The kernel won't keep references to pointers you pass them and write to them later. It just isn't in the DNA. The only exceptions I can think of would be clone(), which is abstracted by the POSIX threads runtime, and Windows-inspired non-standard event i/o system calls like epoll.

评论 #42561680 未加载

rectang5 months ago

After trying and failing over several days to track down a squirrely segfault in a C project about 15 years ago, I taught myself Valgrind in order to debug the issue.Valgrind flagged an "invalid write", which I eventually hunted down as a fencepost error in a dependency which overwrote their allocated stack array by one byte. I recall that it wrote "1" rather than "2", though, haha.> Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!The author's debugging skills are impressive and significantly better than mine, but I find this an unsatisfying takeaway. I yearn for a systemic approach to either prevent such issues altogether or to make them less difficult to troubleshoot. The general solution is to move away from C/C++ to memory safe languages whenever possible, but such choices are of course not always realistic.With my project, I started running most of the test suite under Valgrind periodically. That took took half an hour to finish rather than a few seconds, but it caught many similar memory corruption issues over the next few years.

评论 #42565043 未加载

评论 #42561318 未加载

评论 #42562229 未加载

评论 #42563863 未加载

评论 #42561500 未加载

mhogomchungu5 months ago

Raymond Cheng faced a similar situation here: <a href="https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=110320" rel="nofollow">https://devblogs.microsoft.com/oldnewthing/20240927-00/?p=11...</a>The problem boils down to usage of stack memory after the memory is given to somebody else.

评论 #42558982 未加载

saagarjha5 months ago

Scary. I assume standard memory corruption detection tools would also have trouble finding this, as the write is coming from outside the application itself…

评论 #42559904 未加载

lionkor5 months ago

Very wild bug. I feel like this is some kind of a worst-case "exceptions bad" lesson, but I've only been doing systems level programming for a couple of years so I'm probably talking out my ass.

评论 #42561061 未加载

评论 #42565379 未加载

评论 #42558524 未加载

hrtk5 months ago

I don’t know windows programming and this was a very interesting (nightmare-ish) post.I had a few questions I asked ChatGPT to understand better: <a href="https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d3c" rel="nofollow">https://chatgpt.com/share/677411f9-b8a0-8013-8724-8cdff8dc4d...</a>Very interesting insights about low level programming in general

评论 #42561725 未加载

cryptonector5 months ago

> The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select().This is an absolutely standard trick that is known as "the self-pipe trick". I believe DJB created and named it. It is used for turning APIs not based on file handles/descriptors into events for an event loop based on file handles/descriptors, especially for turning signals into events for select(2)/poll(2)/epoll/kqueue/...

评论 #42563375 未加载

danaris5 months ago

WSPSelect: 'Twas I who wrote "2" to your stack! And I would've gotten away with it, too, if it weren't for that meddling kernel debugger!

评论 #42558769 未加载

cyberax5 months ago

A small request: please stop using automatic translation for blog posts or documentation.Especially when I still have English set as the second priority language.

AshleysBrain5 months ago

Would memory safe languages avoid these kinds of problems? It seems like a good example of a nightmare bug from memory corruption - 5 days to fix and the author alludes to it keeping them up at night is a pretty strong motivation to avoid memory unsafety IMO.

评论 #42559370 未加载

评论 #42558443 未加载

评论 #42559657 未加载

评论 #42561877 未加载

评论 #42559103 未加载

评论 #42558549 未加载

rramadass5 months ago

A Story for the ages. That is some hardcore debugging involving everything viz. user land, system call, kernel, disassembly etc.

glandium5 months ago

I wonder if Time Travel Debugging would have helped narrow it down.

评论 #42560038 未加载

评论 #42558136 未加载

评论 #42564906 未加载

tomsmeding5 months ago

Everyone here is going on about exceptions bad, but let's talk about QueueUserAPC(). Yeah, let's throw an asynchronous interrupt to some other thread that might be doing, you know, anything!In the Unix world we have this too, and it's called signals, but every documentation about signals is sure to say "in a signal handler, almost nothing is safe!". You aren't supposed to call printf() in a signal handler. Throwing exceptions is unthinkable.I skimmed the linked QueueUserAPC() documentation page and it says none of this. Exceptions aren't the handgrenade here (though sure, they're nasty) — QueueUserAPC() is.

评论 #42561105 未加载

评论 #42562525 未加载

评论 #42565445 未加载

评论 #42561790 未加载

DustinBrett5 months ago

It was just a dream, there's no such thing as 2.

hinkley5 months ago

The second piece of code I wrote for pay was a FFI around a c library, which had callbacks to send incremental data back to the caller. I didn’t understand why the documented examples would re-acquire the object handles every iteration through the loop so I dropped them. And everything seemed to work until I got to the larger problems and then I was getting mutually exclusive states in data that was marked as immutable, in some of the objects. I pulled my hair on this for days.What ended up happening is that if the GC ran inside the callback then the objects the native code could see could move, and so the next block of code was smashing the heap by writing to the wrong spots. All the small inputs finished before a GC was called and looked fine but larger ones went into undefined behavior. So dumb.

pantalaimon5 months ago

> Jemand hatte das Gemächt meines Wächters angefasst - und es war definitiv kein Freund.That auto-translation is something else

explosion-s5 months ago

> Somebody had been touching my sentinel’s privates - and it definitely wasn’t a friendGotta love programmers out of context

hun35 months ago

(2016)

评论 #42558453 未加载

diekhans5 months ago

Nicely written (and executed). Worse that my worst memory corruption.

Dwedit5 months ago

TLDR: Kernel wrote memory back to a pointer provided by the user-mode program, as it was supposed to do. Unfortunately, it was a dangling pointer (Use-after-free)When the Kernel does the memory write, user-mode memory debuggers don't see it happen.

mgaunard5 months ago

Set a hardware breakpoint and you'll know immediately. That's what he eventually did, but he should have done so sooner.Then obviously, cancelling an operation is always tricky business with lifetime due to asynchronicity. My approach is to always design my APIs with synchronous cancel semantics, which is sometimes tricky to implement. Many common libraries don't do it right.

评论 #42558545 未加载

评论 #42558433 未加载

评论 #42558535 未加载

评论 #42558871 未加载