TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Debugging memory corruption: who the hell writes "2" into my stack? (2016)

369 点作者 pierremenard5 个月前

25 条评论

dahart5 个月前
It’s been more than a decade since I worked in games, but for my entire games career, any and all use of C++ exceptions was strictly disallowed due to other horror stories. I was still bit in a very similar way by someone’s C++ copy constructor trickery - a crash that only happened in a release build after playing the game for a while, with a stack corruption. Like the author, for me this was one of the hardest bugs I ever had to track down, and I ended up writing a tiny release mode debugger that logged call stacks in order to do it. Once I was able to catch the corruption (after several days of debugging during a crunch weekend), someone on my team noticed the stomp values looked like floating point numbers, and pretty quickly we figured out it was coming from the matrix class trying to be too clever with it’s reference counting IIRC. There’d been a team of around a dozen people trying to track this down during overtime, so it suddenly hit me once we fixed it that someone’s cute idea that took maybe 10 seconds to write cost several tens of thousands of dollars to fix.
评论 #42560376 未加载
评论 #42560296 未加载
评论 #42561529 未加载
bjornsing5 个月前
Back in the day I used to consult for Sony Ericsson. We had like 5000 engineers writing C code that ran as a single executable in a single address space(!). Memory corruption was rampant. So rampant in fact that when we finally got an MMU it took months before we could turn it on in release builds, because there were so many memory corruption bugs even in the released product. The software just wouldn’t work unless it could overwrite memory here and there.
评论 #42565322 未加载
评论 #42559604 未加载
jacinda5 个月前
Related (and hilarious): <a href="https:&#x2F;&#x2F;scholar.harvard.edu&#x2F;files&#x2F;mickens&#x2F;files&#x2F;thenightwatch.pdf" rel="nofollow">https:&#x2F;&#x2F;scholar.harvard.edu&#x2F;files&#x2F;mickens&#x2F;files&#x2F;thenightwatc...</a><p>&gt; What is despair? I have known it—hear my song. Despair is when you’re debugging a kernel driver and you look at a memory dump and you see that a pointer has a value of 7. THERE IS NO HARDWARE ARCHITECTURE THAT IS ALIGNED ON 7. Furthermore, 7 IS TOO SMALL AND ONLY EVIL CODE WOULD TRY TO ACCESS SMALL NUMBER MEMORY. Misaligned, small-number memory accesses have stolen decades from my life.<p>All James Mickens&#x27; USENIX articles are fun (for a very specific subset of computer scientist - the kind that would comment on this thread). <a href="https:&#x2F;&#x2F;mickens.seas.harvard.edu&#x2F;wisdom-james-mickens" rel="nofollow">https:&#x2F;&#x2F;mickens.seas.harvard.edu&#x2F;wisdom-james-mickens</a>
评论 #42561588 未加载
评论 #42562046 未加载
alexvitkov5 个月前
&gt; The project was quite big (although far from the largest ones); it took 40 minutes to build on my machine.<p>A bit tangential, but I&#x27;ve been crying about the insane Unity project build times for years now, and about how they&#x27;ve taken zero steps to fix them and are instead trying their hardest to sell you cloud builds. Glad to see them having to suffer through what they&#x27;re inflicting on us for once!<p>Regardless, very good writeup, and yet another reason to never ever under any conditions use exceptions.
评论 #42560047 未加载
评论 #42564684 未加载
评论 #42559993 未加载
jart5 个月前
This kind of error is a right of passage with WIN32 programming. For example, to do nontrivial i&#x2F;o on Windows you have to create an OVERLAPPED object and give it to ReadFile() and WriteFile() which will return a pending status code, and write back to your OVERLAPPED object once the i&#x2F;o has completed. Usually it makes the most sense to put that object on the stack. So if you return from your function without making sure WIN32 is done with that object, you&#x27;re going to end up with bugs like this one. You have to call GetOverlappedResult() to do that. That means no throwing or returning until you do. Even if you call CancelIoEx() beforehand, you still need to call the result function. When you mix all that up with your WaitForMultipleObjects() call, it ends up being a whole lot of if statements you could easily get wrong if the ritual isn&#x27;t burned into your brain.<p>UNIX system calls never do this. The kernel won&#x27;t keep references to pointers you pass them and write to them later. It just isn&#x27;t in the DNA. The only exceptions I can think of would be clone(), which is abstracted by the POSIX threads runtime, and Windows-inspired non-standard event i&#x2F;o system calls like epoll.
评论 #42561680 未加载
rectang5 个月前
After trying and failing over several days to track down a squirrely segfault in a C project about 15 years ago, I taught myself Valgrind in order to debug the issue.<p>Valgrind flagged an &quot;invalid write&quot;, which I eventually hunted down as a fencepost error in a dependency which overwrote their allocated stack array by one byte. I recall that it wrote &quot;1&quot; rather than &quot;2&quot;, though, haha.<p>&gt; <i>Lesson learnt, folks: do not throw exceptions out of asynchronous procedures if you’re inside a system call!</i><p>The author&#x27;s debugging skills are impressive and significantly better than mine, but I find this an unsatisfying takeaway. I yearn for a systemic approach to either prevent such issues altogether or to make them less difficult to troubleshoot. The general solution is to move away from C&#x2F;C++ to memory safe languages whenever possible, but such choices are of course not always realistic.<p>With my project, I started running most of the test suite under Valgrind periodically. That took took half an hour to finish rather than a few seconds, but it caught many similar memory corruption issues over the next few years.
评论 #42565043 未加载
评论 #42561318 未加载
评论 #42562229 未加载
评论 #42563863 未加载
评论 #42561500 未加载
mhogomchungu5 个月前
Raymond Cheng faced a similar situation here: <a href="https:&#x2F;&#x2F;devblogs.microsoft.com&#x2F;oldnewthing&#x2F;20240927-00&#x2F;?p=110320" rel="nofollow">https:&#x2F;&#x2F;devblogs.microsoft.com&#x2F;oldnewthing&#x2F;20240927-00&#x2F;?p=11...</a><p>The problem boils down to usage of stack memory after the memory is given to somebody else.
评论 #42558982 未加载
saagarjha5 个月前
Scary. I assume standard memory corruption detection tools would also have trouble finding this, as the write is coming from outside the application itself…
评论 #42559904 未加载
lionkor5 个月前
Very wild bug. I feel like this is some kind of a worst-case &quot;exceptions bad&quot; lesson, but I&#x27;ve only been doing systems level programming for a couple of years so I&#x27;m probably talking out my ass.
评论 #42561061 未加载
评论 #42565379 未加载
评论 #42558524 未加载
hrtk5 个月前
I don’t know windows programming and this was a very interesting (nightmare-ish) post.<p>I had a few questions I asked ChatGPT to understand better: <a href="https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;677411f9-b8a0-8013-8724-8cdff8dc4d3c" rel="nofollow">https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;677411f9-b8a0-8013-8724-8cdff8dc4d...</a><p>Very interesting insights about low level programming in general
评论 #42561725 未加载
cryptonector5 个月前
&gt; The fix was pretty straightforward: instead of using QueueUserAPC(), we now create a loopback socket to which we send a byte any time we need to interrupt select().<p>This is an absolutely standard trick that is known as &quot;the self-pipe trick&quot;. I believe DJB created and named it. It is used for turning APIs not based on file handles&#x2F;descriptors into events for an event loop based on file handles&#x2F;descriptors, especially for turning signals into events for select(2)&#x2F;poll(2)&#x2F;epoll&#x2F;kqueue&#x2F;...
评论 #42563375 未加载
danaris5 个月前
<i>WSPSelect</i>: &#x27;Twas I who wrote &quot;2&quot; to your stack! And I would&#x27;ve gotten away with it, too, if it weren&#x27;t for that meddling kernel debugger!
评论 #42558769 未加载
cyberax5 个月前
A small request: please stop using automatic translation for blog posts or documentation.<p>Especially when I still have English set as the second priority language.
AshleysBrain5 个月前
Would memory safe languages avoid these kinds of problems? It seems like a good example of a nightmare bug from memory corruption - 5 days to fix and the author alludes to it keeping them up at night is a pretty strong motivation to avoid memory unsafety IMO.
评论 #42559370 未加载
评论 #42558443 未加载
评论 #42559657 未加载
评论 #42561877 未加载
评论 #42559103 未加载
评论 #42558549 未加载
rramadass5 个月前
A Story for the ages. That is some hardcore debugging involving everything viz. user land, system call, kernel, disassembly etc.
glandium5 个月前
I wonder if Time Travel Debugging would have helped narrow it down.
评论 #42560038 未加载
评论 #42558136 未加载
评论 #42564906 未加载
tomsmeding5 个月前
Everyone here is going on about exceptions bad, but let&#x27;s talk about QueueUserAPC(). Yeah, let&#x27;s throw an asynchronous interrupt to some other thread that might be doing, you know, anything!<p>In the Unix world we have this too, and it&#x27;s called signals, but every documentation about signals is sure to say &quot;in a signal handler, almost nothing is safe!&quot;. You aren&#x27;t supposed to call printf() in a signal handler. Throwing exceptions is unthinkable.<p>I skimmed the linked QueueUserAPC() documentation page and it says none of this. Exceptions aren&#x27;t the handgrenade here (though sure, they&#x27;re nasty) — QueueUserAPC() is.
评论 #42561105 未加载
评论 #42562525 未加载
评论 #42565445 未加载
评论 #42561790 未加载
DustinBrett5 个月前
It was just a dream, there&#x27;s no such thing as 2.
hinkley5 个月前
The second piece of code I wrote for pay was a FFI around a c library, which had callbacks to send incremental data back to the caller. I didn’t understand why the documented examples would re-acquire the object handles every iteration through the loop so I dropped them. And everything seemed to work until I got to the larger problems and then I was getting mutually exclusive states in data that was marked as immutable, in some of the objects. I pulled my hair on this for days.<p>What ended up happening is that if the GC ran inside the callback then the objects the native code could see could move, and so the next block of code was smashing the heap by writing to the wrong spots. All the small inputs finished before a GC was called and looked fine but larger ones went into undefined behavior. So dumb.
pantalaimon5 个月前
&gt; Jemand hatte das Gemächt meines Wächters angefasst - und es war definitiv kein Freund.<p>That auto-translation is something else
explosion-s5 个月前
&gt; Somebody had been touching my sentinel’s privates - and it definitely wasn’t a friend<p>Gotta love programmers out of context
hun35 个月前
(2016)
评论 #42558453 未加载
diekhans5 个月前
Nicely written (and executed). Worse that my worst memory corruption.
Dwedit5 个月前
TLDR: Kernel wrote memory back to a pointer provided by the user-mode program, as it was supposed to do. Unfortunately, it was a dangling pointer (Use-after-free)<p>When the Kernel does the memory write, user-mode memory debuggers don&#x27;t see it happen.
mgaunard5 个月前
Set a hardware breakpoint and you&#x27;ll know immediately. That&#x27;s what he eventually did, but he should have done so sooner.<p>Then obviously, cancelling an operation is always tricky business with lifetime due to asynchronicity. My approach is to always design my APIs with synchronous cancel semantics, which is sometimes tricky to implement. Many common libraries don&#x27;t do it right.
评论 #42558545 未加载
评论 #42558433 未加载
评论 #42558535 未加载
评论 #42558871 未加载