Heh, 10 years ago I gave a presentation about how easy folks used to x86 can trip up when dealing with ARM's weaker memory model. My demonstration then was with a naive implementation of Peterson's algorithm.[1]<p>I have a feeling that we will see a sharp rise of stories like this, now that ARM finds itself in more places which were previously mostly occupied by x86, and all the subtle race conditions that x86's memory model forgave actually start failing, in equally subtle ways.<p>[1] The conclusion for this particular audience was: Don't try to avoid synchronization primitives, or even invent your own. They were not system level nor high perf code programmers, so they had that luxury.
Like quantum physics, memory ordering is deeply unintuitive (on platforms like ARM). Unlike quantum physics, which is an unfortunate immutable fact of the universe, we got ourselves into this mess and we have no one to blame but ourselves for it.<p>I'm only somewhat joking. People need to understand these memory models if they intend on writing atomic operations in their software, even if they aren't currently targeting ARM platforms. In this era, it's absurdly easy to change an an LLVM compiler to target aarch64, and it will happen for plenty of software that was written without ever considering the differences in atomic behavior on this platform.
I spent some time trying to figure out why the lock-free read/write implementation is correct under x86, assuming a multiprocessor environment.<p>My read of the situation was that there's already potential for a double-read / double-write between when the spinlock returns and when the head/tail index is updated.<p>Turns out that I was missing something: there's only one producer thread, and only one consumer thread. If there were multiple of either, then this code would be more fundamentally broken.<p>That said: IMO the use of `new` in modern C++ (as is the case in the writer queue) is often a code smell, especially when std::make_unique would work just as well. Using a unique_ptr would obviate the first concern [0] about the copy constructor not being deleted.<p>(If we used unique_ptr consistently here, we might fix the scary platform-dependent leak in exchange for a likely segfault following a nullptr dereference.)<p>One other comment: the explanation in [1] is slightly incorrect:<p>> we receive back Result* pointers from the results queue rq, then wrap them in a std::unique_ptr and jam them into a vector.<p>We actually receive unique_ptrs from the results queue, then because, um, reasons (probably that we forgot that we made this a unique_ptr), we're wrapping them in another unique_ptr, which works because we're passing a temporary (well, prvalue in C++17) to unique_ptr's constructor -- while that looks like it might invoke the deleted copy-constructor, it's actually an instance of guaranteed copy elision. Also a bit weird to see, but not an issue of correctness.<p>[0] <a href="https://github.com/stong/how-to-exploit-a-double-free#0-internal-data-structures" rel="nofollow">https://github.com/stong/how-to-exploit-a-double-free#0-inte...</a><p>[1] <a href="https://github.com/stong/how-to-exploit-a-double-free#2-receive-results" rel="nofollow">https://github.com/stong/how-to-exploit-a-double-free#2-rece...</a>
Either I'm not understanding something that I thought I understood very well, or TFA's author's don't understand something that they think they understand very well.<p>Their code is unsafe even on x86. You cannot write a single-writer, single-reader FIFO on modern processors without the use of memory barriers.<p>Their attempt to use "volatile" instead of memory barriers is not appropriate. It could easily cause problems on x86 platforms in just the same way that it could on ARM. "volatile" does not mean what you think it means; if you're using it for anything other than interacting with hardware registers in a device driver, you're almost certainly using it incorrectly.<p>You must use the correct memory barriers to protect the read/write of what they call "head" and "tail". Without them, the code is just wrong, no matter what the platform.
Lock-free programming is really tough. There are really only a few patterns that work (e.g. Treiber stack). Trying to invent a new lock-free algorithm, as this vulnerable code demonstrates, almost always ends in tears.
For those interested in memory ordering, I have a few posts on my blog where I build a simulator capable of understanding reorderings and analyze examples with it:<p><a href="https://www.reitzen.com/post/temporal-fuzzing-01/" rel="nofollow">https://www.reitzen.com/post/temporal-fuzzing-01/</a>
<a href="https://www.reitzen.com/post/temporal-fuzzing-02/" rel="nofollow">https://www.reitzen.com/post/temporal-fuzzing-02/</a><p>Next step are some lock free queues, although I haven't gotten around to publishing them!
Have i told you about our lord and savior Rust?<p>Anyways, <a href="https://github.com/tokio-rs/loom" rel="nofollow">https://github.com/tokio-rs/loom</a> is used by any serious library doing atomic ops/synchronization and it blew me away with how fast it can catch most bugs like this.
My first gen threadripper occasionally deadlocks in futex code within libgomp (gnu implementation of omp). Eventually I gave up and concluded it was either a hardware bug or a bug that incorrectly relies on atomic behaviour of intel CPUs. I eventually switched to using clang with its own omp implementation and the problem magically disappeared.
> Nowadays, high-performance processors, like those found in desktops, servers, and phones, are massively out-of-order to exploit instruction-level parallelism as much as possible. They perform all sorts of tricks to improve performance.<p>Relevant quote from Jim Keller: You run this program a hundred times, it never runs the same way twice. Ever.
Great write-up!<p>There may be a typo in section 3:<p>> It will happily retire instruction 6 before instruction 5.<p>If memory serves, although instructions can execute out-of-order, they retire in-order (hence the "re-order buffer").
The best part is that the original code is not safe even on x86 as the compiler can still reorder non-volatile accesses to the backing_buf around the volatile accesses to head and tails. Compiler barriers before the volatile stores and after volatile reads are required [1]. It would still be very questionable code, but it would at least have a chance to work on its intended target.<p>tl;dr: just use std::atomic.<p>[1] it is of course possible they are actually present in the original code and just omitted from the explanation for brevity
There is a proposal (possibly accepted) to deprecate 'volatile' in C++.<p><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1152r0.html" rel="nofollow">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p115...</a>