“Unexplainable” core dump (2011)

193 pointsby curling_gradover 2 years ago

14 comments

Izkataover 2 years ago

> Our code and compilers are constantly changing, and the problem disappeared as suddenly as it appeared ... only to happen again 2 years later in a completely unrelated executable.It does not encourage me how much this sounds like the short story "Coding Machines". The original post even happened right about 2 years after the short story was posted, then in that comment reoccured after another 2 years.<a href="https://www.teamten.com/lawrence/writings/coding-machines/" rel="nofollow">https://www.teamten.com/lawrence/writings/coding-machines/</a>

评论 #34239480 未加载

评论 #34238754 未加载

评论 #34240301 未加载

评论 #34238509 未加载

评论 #34243206 未加载

评论 #34239907 未加载

dekhnover 2 years ago

One of the best bugs I've seen had a description fairly similar to this. Hot routine run at scale (floating point math for ads ML training) fails at a rate about 0.000000001. Turned out to be a very obscure bug in the context switching code in the linux kernel, the FP registers weren't being restored properly.Another one, the debugging was aided by the fact the developers ensured that everything was accessed through const pointers, so it wasn't their code corrupting their memory.

评论 #34238100 未加载

评论 #34239341 未加载

throwawaylinuxover 2 years ago

In this CPU, AMD first introduced the "stack engine". This is likely a bug in that feature. It has a speculative stack address delta register in the front-end that is updated directly with push/pop instructions, and that delta is dispatched with the stack memory uop to be added to the original stack address register when doing address generation in the load/store units.The delta has to be small, because you don't want big adders in the front end and because the delta has to be sent down with the push/pop memory uops. That means it can overflow or underflow, at which point it has to be reset by sending a synchronize operation to the back-end to update the original stack register (agner has a better description).So the delta register is probably 10-12 bits on Barcelona, and this bug is probably a corner case where the stack register update is happening, hence 1024 bytes off. Perhaps there is a window where a uop can get the old delta value, but the new base value (or vice versa) when a sync is operation is concurrent.Setting that MSR value possibly disables that stack engine feature entirely, or it could be it disables some aggressive and complicated detail where the bug is, e.g., allowing stack operations to run concurrently while flush operations are in progress.It's not a coincidence there just happens to exist a way to disable this at runtime. The way processors are designed means that everything must be able to be observed, debugged, and fixed in the field. That means everything has to have fine-grained ability to control, disable, enable safer fallback paths, and even engage additional logic to reduce the state space in some cases (e.g., serialize pipeline while a particular operation occurs).Usually these bug fixes decrease performance (except in cases where a performance bug is found and the fix actually increases performance), so you want the switches to be very fine-grained. So it's possible they fixed the stack engine bug without disabling it entirely.It would be like shipping software and providing support and bug fixes for it for the next 5-10 years without patching the software, only updating the config file. It's quite amazing. For every one of these issues that hits the field, there will be many found internally during the internal hardware bring up and verification (which will be ongoing for at least part of the life of the CPU).

评论 #34243717 未加载

jalbertoniover 2 years ago

Reminds me of what we dubbed "the cosmic ray incident."During college, we were arranged in groups of 2 or 3 to do some pair programming for the more complicated exercises.I had attached the debugger and had set a variable to 99, so the loop would execute one more time and we could test if our changes would work.Went through the instructions step by step, and suddenly we got a segmentation fault. Between setting it to 99 and it accessing some data later on, the value changed to 107.Quite a bit of confusion ensued. I made a backup of the executable before recompiling. Running them again, both the new version and the backup worked perfectly. The files matched bit for bit. To this day we have no idea what caused that bit flip.

ksherlockover 2 years ago

Another good source of "impossible" bugs is overclocking.<a href="https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35923" rel="nofollow">https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...</a>

评论 #34241184 未加载

dekhnover 2 years ago

评论 #34238832 未加载

m-eeover 2 years ago

I had something similar to some of the things described here, with a dumber cause. Started getting hardfaults in my firmware after a context switch. Thought bad array access, incorrect FPU settings, misbehaving interrupt. All dead ends.Realized that it was only happening when a particular motor was running, then further isolated to when it was only running at higher duty cycles. I had tested the motor in isolation but at low speeds, so I didn't see issues until I tried to run the full application. Turns out the EE had royally screwed up the current sense circuit, and the MCU on the ADC was seeing voltages lower than -1V, well beyond the absolute max ratings for the chip.I guess ST doesn't make guarantees about what happens when you violate those ratings, but a corrupt stack pointer is not what I would have expected.

ericbarrettover 2 years ago

This is great. Reminds me of a crash I saw early in my career. It was a null-pointer exception, except it occurred right after confirming the address was non-null. This was on a single core with a non-preemptible kernel. So the processor just took the wrong branch! There was simply no other explanation.

评论 #34238647 未加载

评论 #34238480 未加载

评论 #34238694 未加载

评论 #34238279 未加载

throwaway2037over 2 years ago

I recognize this StackOverflow user name. I have read many of their answers. They are usually excellent.See more here: <a href="https://stackoverflow.com/users/50617/employed-russian" rel="nofollow">https://stackoverflow.com/users/50617/employed-russian</a>

ndemirover 2 years ago

This reminded me a problem that I have seen almost 17-18 years ago. There was a file that we were trying to download but it was just being interrupted during download process. When we tried to download the file on another computer, it was fine. Long story short; we changed the ethernet card of that computer and it started downloading file. I don't remember all the details; but probably something wrong witj ethernet card driver.

deathanatosover 2 years ago

I think the closest I ever came to this was when I stumbled into a bug in git.We were asking git to do something utterly trivial, like clone, and it was segfaulting. We installed the debug symbols or something (as I recall you can install these separately in Debian) and started trying to see where the function was crashing. The crash was in a parser, and the code was doing,<pre><code> char *foo = strstr(input, "something constant"); </code></pre> and segfaulting later (but not chasing null!), during the first use of foo. I figured that input must, therefore, be a bad pointer, since the string literal is by definition fine, and the only way to get a bad pointer out of strstr was to give it bad input in the first place.So, in gdb, I print the input point. It's valid. "Damn" I think to myself "we've gotten unlucky, and whatever triggers the cascade of UB hasn't happened this run. Freakin' heisenbug." So, I told gdb to just continue running the program: it crashed. Same backtrace: dereferencing foo triggered a segfault, and not with a null pointer.If the input pointer to strstr is valid, how can the output be a bad pointer? strstr is documented as,<pre><code> These functions return a pointer to the beginning of the located substring, or NULL if the substring is not found. </code></pre> So, what gives?I attempted to debug strstr … that was almost a mistake. strstr is, unfortunately heavily optimized. It might have been this one: <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strstr-sse2-unaligned.S;h=c6aa8f45a60d175826c3d4e6eb483495a2cbf424;hb=f2698954ff9c2f9626d4bcb5a30eb5729714e0b0" rel="nofollow">https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86...</a>I'm okay with a debugger, but I'm not going the be able to follow where vectorized code is going wrong. Almost stupidly, I told strstr to (f)inish the function's execution, at which point gdb printed the return value: it was a valid pointer!"Odd. Is this run going to succeed?" I continue execution: it crashes.I restart the debugging, and run to the completion of strstr: the pointer is correct, but again, crash. How. The segfaulting address isn't the pointer strstr is returning, either, and nothing modifies foo after the strstr.I single step out of strstr, and immediately print foo. The resulting pointer is wrong. strstr is working fine … but what, the assignment operator is broken? At this point, I'm sure I'm crazy. I disassemble the source, and lo and behold, the disassembly is trivially wrong. Forgive me as I can't amd64 unless I'm looking at it, but the disassembly is something like,<pre><code> call strstr cltq <store foo> </code></pre> We look the odd instruction up; it sign extends eax into rax. That's … not a valid operation on a supposed char pointer … there's not even technically a value in eax at this point. strstr returned a pointer in rax. eax is just the lower 32 bits of that pointer, and sign extending that back into rax makes no sense. It's like the compiler thought the return from strstr was a signed int … and that's where it hits me: C is shooting us in the foot again.> and if no declaration is visible for this identifier, the identifier is implicitly declared exactly as if. in the innermost block containing the function call. the declaration> extern int identifier();> appeared.The default return for an undeclared function, in C89, is int. (I'm actually not sure what C11 says about this. The C89 rule appears to be gone, but there doesn't seem to be anything in its place, which is bizarre. "Undeclared" does not appear, except for an unrelated footnote.)Slap in the proper include for strstr, recompile. No segfault, disassembly is correct.Upgrade to latest version, disassemble: disassembly is correct. Someone else fixed the bug, in the meantime. Should have just tried upgrading in the first place…(After typing this up: had to dig up the debugging session. It was strchr, not strstr. Potato, potato. Someday … this should be a blog post…)

markus_zhangover 2 years ago

Sounds fun, wish I had the talent to do some serious debugging.

jeffrallenover 2 years ago

If you are lucky, you'll debug a CPU bug once in a lifetime.If you're really lucky, you'll never see one at all!

jacooperover 2 years ago

Debugging that must've been a PITA for sure.

评论 #34237590 未加载

14 comments

Izkataover 2 years ago

评论 #34239480 未加载

评论 #34238754 未加载

评论 #34240301 未加载

评论 #34238509 未加载

评论 #34243206 未加载

评论 #34239907 未加载

dekhnover 2 years ago

评论 #34238100 未加载

评论 #34239341 未加载

throwawaylinuxover 2 years ago

评论 #34243717 未加载

jalbertoniover 2 years ago

ksherlockover 2 years ago

评论 #34241184 未加载

dekhnover 2 years ago

评论 #34238832 未加载

m-eeover 2 years ago

ericbarrettover 2 years ago

评论 #34238647 未加载

评论 #34238480 未加载

评论 #34238694 未加载

评论 #34238279 未加载

throwaway2037over 2 years ago

ndemirover 2 years ago

deathanatosover 2 years ago

markus_zhangover 2 years ago

Sounds fun, wish I had the talent to do some serious debugging.

jeffrallenover 2 years ago

If you are lucky, you'll debug a CPU bug once in a lifetime.If you're really lucky, you'll never see one at all!

jacooperover 2 years ago

Debugging that must've been a PITA for sure.

评论 #34237590 未加载