TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A bug that doesn’t exist on x86: Exploiting an ARM-only race condition

291 pointsby stong1over 3 years ago

18 comments

anyfooover 3 years ago
Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM&#x27;s weaker memory model. My demonstration then was with a naive implementation of Peterson&#x27;s algorithm.[1]<p>I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86&#x27;s memory model forgave actually start failing, in equally subtle ways.<p>[1] The conclusion for this particular audience was: Don&#x27;t try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.
评论 #28998927 未加载
评论 #28999348 未加载
评论 #28999846 未加载
beebmamover 3 years ago
Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.<p>I&#x27;m only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren&#x27;t currently targeting ARM platforms. In this era, it&#x27;s absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.
评论 #28997586 未加载
评论 #28999997 未加载
vitusover 3 years ago
I spent some time trying to figure out why the lock-free read&#x2F;write implementation is correct under x86, assuming a multiprocessor environment.<p>My read of the situation was that there&#x27;s already potential for a double-read &#x2F; double-write between when the spinlock returns and when the head&#x2F;tail index is updated.<p>Turns out that I was missing something: there&#x27;s only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.<p>That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.<p>(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)<p>One other comment: the explanation in [1] is slightly incorrect:<p>&gt; we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.<p>We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we&#x27;re wrapping them in another unique_ptr, which works because we&#x27;re passing a temporary (well, prvalue in C++17) to unique_ptr&#x27;s constructor -- while that looks like it might invoke the deleted copy-constructor, it&#x27;s actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;stong&#x2F;how-to-exploit-a-double-free#0-internal-data-structures" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;stong&#x2F;how-to-exploit-a-double-free#0-inte...</a><p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;stong&#x2F;how-to-exploit-a-double-free#2-receive-results" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;stong&#x2F;how-to-exploit-a-double-free#2-rece...</a>
评论 #29000521 未加载
评论 #29002094 未加载
评论 #29003008 未加载
PaulDavisThe1stover 3 years ago
Either I&#x27;m not understanding something that I thought I understood very well, or TFA&#x27;s author&#x27;s don&#x27;t understand something that they think they understand very well.<p>Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.<p>Their attempt to use &quot;volatile&quot; instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. &quot;volatile&quot; does not mean what you think it means; if you&#x27;re using it for anything other than interacting with hardware registers in a device driver, you&#x27;re almost certainly using it incorrectly.<p>You must use the correct memory barriers to protect the read&#x2F;write of what they call &quot;head&quot; and &quot;tail&quot;. Without them, the code is just wrong, no matter what the platform.
评论 #29000541 未加载
评论 #29001465 未加载
评论 #29000750 未加载
评论 #29002260 未加载
评论 #29001469 未加载
评论 #29000440 未加载
评论 #29000662 未加载
评论 #29000929 未加载
pcwaltonover 3 years ago
Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.
评论 #28997191 未加载
评论 #28997639 未加载
评论 #28996977 未加载
评论 #29000315 未加载
reitzensteinmover 3 years ago
For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:<p><a href="https:&#x2F;&#x2F;www.reitzen.com&#x2F;post&#x2F;temporal-fuzzing-01&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reitzen.com&#x2F;post&#x2F;temporal-fuzzing-01&#x2F;</a> <a href="https:&#x2F;&#x2F;www.reitzen.com&#x2F;post&#x2F;temporal-fuzzing-02&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.reitzen.com&#x2F;post&#x2F;temporal-fuzzing-02&#x2F;</a><p>Next step are some lock free queues, although I haven&#x27;t gotten around to publishing them!
Azsyover 3 years ago
Have i told you about our lord and savior Rust?<p>Anyways, <a href="https:&#x2F;&#x2F;github.com&#x2F;tokio-rs&#x2F;loom" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tokio-rs&#x2F;loom</a> is used by any serious library doing atomic ops&#x2F;synchronization and it blew me away with how fast it can catch most bugs like this.
评论 #28997201 未加载
0xfadedover 3 years ago
My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.
silisiliover 3 years ago
&gt; Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.<p>Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.
评论 #28998748 未加载
评论 #29000895 未加载
agalunarover 3 years ago
Great write-up!<p>There may be a typo in section 3:<p>&gt; It will happily retire instruction 6 before instruction 5.<p>If memory serves, although instructions can execute out-of-order, they retire in-order (hence the &quot;re-order buffer&quot;).
评论 #28998362 未加载
评论 #29001998 未加载
gpderettaover 3 years ago
The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.<p>tl;dr: just use std::atomic.<p>[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity
secondcomingover 3 years ago
There is a proposal (possibly accepted) to deprecate &#x27;volatile&#x27; in C++.<p><a href="http:&#x2F;&#x2F;www.open-std.org&#x2F;jtc1&#x2F;sc22&#x2F;wg21&#x2F;docs&#x2F;papers&#x2F;2018&#x2F;p1152r0.html" rel="nofollow">http:&#x2F;&#x2F;www.open-std.org&#x2F;jtc1&#x2F;sc22&#x2F;wg21&#x2F;docs&#x2F;papers&#x2F;2018&#x2F;p115...</a>
评论 #28999351 未加载
half-kh-hackerover 3 years ago
this slaps. I always see perfect blue a few places above us!
评论 #28997714 未加载
cookiewillover 3 years ago
Is it normal for the .got.plt section to be writable rather than read-only?
ameliusover 3 years ago
Does the race condition exist when emulating x86 on Apple M1?
评论 #28998499 未加载
评论 #28998317 未加载
sydthrowawayover 3 years ago
Any good references on low level details on ARMv8+?
im3w1lover 3 years ago
And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.
评论 #28996908 未加载
评论 #28996998 未加载
评论 #28996898 未加载
drcongoover 3 years ago
Nice try Intel.