A bug that doesn’t exist on x86: Exploiting an ARM-only race condition

291 pointsby stong1over 3 years ago

18 comments

anyfooover 3 years ago

Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.

评论 #28998927 未加载

评论 #28999348 未加载

评论 #28999846 未加载

beebmamover 3 years ago

Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.

评论 #28997586 未加载

评论 #28999997 未加载

vitusover 3 years ago

I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)One other comment: the explanation in [1] is slightly incorrect:> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.[0] <a href="https://github.com/stong/how-to-exploit-a-double-free#0-internal-data-structures" rel="nofollow">https://github.com/stong/how-to-exploit-a-double-free#0-inte...</a>[1] <a href="https://github.com/stong/how-to-exploit-a-double-free#2-receive-results" rel="nofollow">https://github.com/stong/how-to-exploit-a-double-free#2-rece...</a>

评论 #29000521 未加载

评论 #29002094 未加载

评论 #29003008 未加载

PaulDavisThe1stover 3 years ago

Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.

评论 #29000541 未加载

评论 #29001465 未加载

评论 #29000750 未加载

评论 #29002260 未加载

评论 #29001469 未加载

评论 #29000440 未加载

评论 #29000662 未加载

评论 #29000929 未加载

pcwaltonover 3 years ago

Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.

评论 #28997191 未加载

评论 #28997639 未加载

评论 #28996977 未加载

评论 #29000315 未加载

reitzensteinmover 3 years ago

For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:<a href="https://www.reitzen.com/post/temporal-fuzzing-01/" rel="nofollow">https://www.reitzen.com/post/temporal-fuzzing-01/</a> <a href="https://www.reitzen.com/post/temporal-fuzzing-02/" rel="nofollow">https://www.reitzen.com/post/temporal-fuzzing-02/</a>Next step are some lock free queues, although I haven't gotten around to publishing them!

Azsyover 3 years ago

Have i told you about our lord and savior Rust?Anyways, <a href="https://github.com/tokio-rs/loom" rel="nofollow">https://github.com/tokio-rs/loom</a> is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.

评论 #28997201 未加载

0xfadedover 3 years ago

My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.

silisiliover 3 years ago

> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.

评论 #28998748 未加载

评论 #29000895 未加载

agalunarover 3 years ago

Great write-up!There may be a typo in section 3:> It will happily retire instruction 6 before instruction 5.If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").

评论 #28998362 未加载

评论 #29001998 未加载

gpderettaover 3 years ago

The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.tl;dr: just use std::atomic.[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity

secondcomingover 3 years ago

There is a proposal (possibly accepted) to deprecate 'volatile' in C++.<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1152r0.html" rel="nofollow">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...</a>

评论 #28999351 未加载

half-kh-hackerover 3 years ago

this slaps. I always see perfect blue a few places above us!

评论 #28997714 未加载

cookiewillover 3 years ago

Is it normal for the .got.plt section to be writable rather than read-only?

ameliusover 3 years ago

Does the race condition exist when emulating x86 on Apple M1?

评论 #28998499 未加载

评论 #28998317 未加载

sydthrowawayover 3 years ago

Any good references on low level details on ARMv8+?

im3w1lover 3 years ago

And arm-windows will (does already?) run x86 binaries with weaker memory ordering than they were written for. So this could be a real thing soon.

评论 #28996908 未加载

评论 #28996998 未加载

评论 #28996898 未加载

drcongoover 3 years ago

Nice try Intel.