ARM and Lock-Free Programming

229 点作者 markdog12超过 4 年前

17 条评论

runeks超过 4 年前

I appreciate learning more about what, exactly, ARM's "weaker memory model" constitutes. It's clearer to me after reading this article.I wonder how much the gain in performance of e.g. Apple's M1 chip, compared to an x86 CPU, can be attributed to this weaker constraint. Given that the M1 can outperform an x86 CPU even when emulating x86 code, perhaps it's not much.Also, I suspect programming languages that are immutable by default will gain a larger advantage using ARM's weaker memory model, as the compiler can more often safely let the CPU perform reordering (due to not having to wait for a mutable variable being updated until it can execute a subsequent line of code which depends on this updated variable).

评论 #25263461 未加载

评论 #25265755 未加载

评论 #25264217 未加载

评论 #25263299 未加载

dataflow超过 4 年前

If you want to dig deeper, watch these videos: <a href="https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/" rel="nofollow">https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memor...</a>

评论 #25263331 未加载

评论 #25275501 未加载

cesaref超过 4 年前

The most annoying aspect of lock free programming is the amount of 'roll your own' that goes on, and having to inspect this stuff to ensure it's got a chance of working on the platforms you care about.I totally get that this technique is inappropriate for many situations where a lock works well enough, but I work in a field (audio) where accidental use of locks and system calls is in general a larger problem than incorrect lock free code.

评论 #25263397 未加载

jnwatson超过 4 年前

This is common knowledge in embedded circles, where weak memory models are the norm.This article doesn't address a different kind of memory weakness in some ARM implementations (and I presume other architectures) that you can run into when sharing memory between processes.The data cache is virtually tagged, virtually indexed, which means that the cache is keyed with the domain (essentially the process number) + the virtual address. This means that the MMU doesn't have to be consulted for cache lookups, however, if you have two different processes mapped into the same physical memory, the writing process must execute a flush instruction, and the reading process must execute an invalidate instruction.

评论 #25268086 未加载

alvarelle超过 4 年前

> ... std::atomic<bool>. This tells the compiler not to elide reads and writes of this variable, ...If I'm not mistaken, this is not true. The compiler is still allowed to elide reads and writes on atomic variables. (For example merge two consecutive writes, or remove unused reads)

评论 #25264623 未加载

评论 #25266603 未加载

xmodem超过 4 年前

Aside - calling it now - but the M1's x86 memory model mode is a temporary thing that'll go away in a few years along with Rosetta 2 once Apple is happy with the level of universal binary support.

评论 #25265829 未加载

评论 #25265546 未加载

JoeAltmaier超过 4 年前

Very educational.I take exception with the common assertion that 'code is broken' if it doesn't work everywhere for every architecture. Code deployed for a particular purpose on a particular machine with a particular toolchain can work great and be 'correct' for that environment. I work in such environments all the time. And taking the time/performance hit seeking ill-conceived perfection wastes the clients time and money.But for the vast majority of cloud-deployed or open-source prjects, its certainly wise to err on the side of 'perfect'.

评论 #25266553 未加载

vbezhenar超过 4 年前

What I'm afraid is with the introduction of ARM desktops, there will be subtle bugs which did not manifest on x86. Those bugs are hard to find in old legacy multi-million LoC codebase with lots of spaghetti code nobody bothered to refactor, so those bugs will haunt users for years. That's why I'm reluctant to move to this architecture at least for a few years, even if it's superior in theory.Thankfully ARM phones were a thing for a lot of time, so many libraries should work well.

fulafel超过 4 年前

Piggybacking this question for people interested in the topic:Which everyday high level data structures you use are built on lock-free implementations?Clojure's maps/vectors (which are persistent data structures) come to mind for me, for one.

评论 #25263043 未加载

评论 #25263395 未加载

评论 #25262936 未加载

评论 #25274042 未加载

评论 #25263077 未加载

评论 #25265022 未加载

评论 #25263039 未加载

评论 #25263129 未加载

jhallenworld超过 4 年前

I thought barriers were required on x86 also. There are a few cases where ordering is not guaranteed. [Also the consumer should have a barrier between the test and the data read.]<a href="https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/" rel="nofollow">https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fe...</a>The thing that's nice about x86 is that DMA is cache coherent- this is true for backwards compatibility with very old hardware.

评论 #25265952 未加载

andy_threos_io超过 4 年前

Lock-Free programming is essential in high performance networking (10G+ network connections) and data communications. Most of the implementation are using ringbuffers. There can be multiple producer consumer implementation as well.But it's not a real issue for the new M1 minis as its network is down to 1Gbit (from the previous mini 10G).

评论 #25267411 未加载

KingMachiavelli超过 4 年前

Does Rust and it's borrow checker make this easier/implicit?If compiled code will have to work on both RISC and CISC for the near future. Is the burden to do lock free C++ by identifying all the memory conditions outway the learning curve of Rust?Don't know Rust beyond dead simple projects but just curious.

评论 #25262902 未加载

评论 #25263096 未加载

评论 #25263725 未加载

评论 #25263919 未加载

marcan_42超过 4 年前

This reminds me of some lock-free code I fixed recently (by introducing a very peculiar kind of lock):<a href="https://github.com/Ardour/ardour/blob/master/libs/pbd/pbd/rcu.h" rel="nofollow">https://github.com/Ardour/ardour/blob/master/libs/pbd/pbd/rc...</a>The goal of the code is to allow a producer thread to update a pointer to an object in memory, which may be used lazily by consumer threads. Consumers are allowed to use stale versions of the object, but must not see inconsistent state. Obsolete objects must eventually be freed, but only after all consumers are done using them. And here's the catch: consumers are real-time code, so they must never block on a lock, nor are they allowed to free nor allocate memory on their own.The design is based around Boost's shared_ptr, so that had to stay. But the original code was subtly broken: the object is passed around by passing a pointer to a shared_ptr (so double indirection there) which gets cloned, but there was a race condition where the original shared_ptr might be freed by the producer while the consumer is in the process of cloning it.My solution ended up being to introduce a "lopsided spinlock". Consumers aren't allowed to block, but producers are. So consumers can locklessly signal, via an atomic variable, that they are currently inside their critical section. Then the producer can atomically (locklessly) swap the pointer at any time, but must spin (is this a half-lock?) until no consumers are in the critical section before freeing the old object. This ensures that any users of the old pointer have completed their clone of the shared_ptr, and therefore freeing the old one will no longer cause a use-after-free. Finally, the producer holds on to a reference to the object until it can prove that no consumers remain, to guarantee that the memory deallocation always happens in producer context. shared_ptr then ensures that the underlying object lives on until it is no longer needed.It's kind of weird to see a half-spinlock like this, where one side has a "critical section" with no actual locking, and the other side waits for a "lock" to be released but then doesn't actually lock anything itself. But if you follow the sequence of operations, it all works out and is correct (as far as I can tell).For various reasons this code has to build with older compilers, hence can't use C++ atomics. That's why it's using glib ones.(Note: I'm not 100% sure there are no hidden allocations breaking the realtime requirement on the reader side; I went into this codebase to fix the race condition, but I haven't attempted to prove that the realtime requirement was properly respected in the original code; you'd have to carefully look at what shared_ptr does behind the scenes)

评论 #25265798 未加载

bullen超过 4 年前

As long as this code runs fine on ARM we're going to be ok:<a href="http://move.rupy.se/file/atomic.txt" rel="nofollow">http://move.rupy.se/file/atomic.txt</a>

mlindner超过 4 年前

This guy is a member of Antifa. I'm not a fan of people reposting articles by known left-wing terrorist organizations on hacker news.

signa11超过 4 年前

previously : <a href="https://news.ycombinator.com/item?id=25255113" rel="nofollow">https://news.ycombinator.com/item?id=25255113</a>

评论 #25262593 未加载

ncmncm超过 4 年前

The importance and value of lock-free programming is massively overrated.I routinely improve performance and also reliability of systems by deleting lock-free queues. The secret to both is reduced coupling, which lock-free methods do nothing to help with. Atomic operations invoke hardware mechanisms that, at base, are hardly different from locks. While they won't deadlock by themselves, they are so hard to get right that failures equally as bad as deadlocking are hard to avoid.So, replace threads with separate processes that communicate via ring buffers, and batch work to make interaction less frequent. With interaction infrequent enough, time spent synchronizing, when you actually do it, becomes negligible.

评论 #25264023 未加载

评论 #25264508 未加载