The author has an implicit definition of "faster" which it is important to be aware of.<p>The main use of spinlocks that i'm aware of is minimising <i>latency</i> in inter-processor communication.<p>That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.<p>For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.
This experiment is a bit weird. If you look at <a href="https://github.com/matklad/lock-bench" rel="nofollow">https://github.com/matklad/lock-bench</a>, this was run on a machine with 8 logical CPUs, but the test is using 32 threads. It's not that surprising that running 4x as many threads as there are CPUs doesn't make sense for spin locks.<p>I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.<p>Heavy contention run:<p><pre><code> $ cargo run --release 4 64 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 64 10000 100`
Options {
n_threads: 4,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 2.822382ms min 1.459601ms max 3.342966ms
parking_lot::Mutex avg 1.070323ms min 760.52µs max 1.212874ms
spin::Mutex avg 879.457µs min 681.836µs max 990.38µs
AmdSpinlock avg 915.096µs min 445.494µs max 1.003548ms
std::sync::Mutex avg 2.832905ms min 2.227285ms max 3.46791ms
parking_lot::Mutex avg 1.059368ms min 507.346µs max 1.263203ms
spin::Mutex avg 873.197µs min 432.016µs max 1.062487ms
AmdSpinlock avg 916.393µs min 568.889µs max 1.024317ms
</code></pre>
Extreme contention run:<p><pre><code> $ cargo run --release 4 2 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 2 10000 100`
Options {
n_threads: 4,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 4.552701ms min 2.699316ms max 5.42634ms
parking_lot::Mutex avg 2.802124ms min 1.398002ms max 4.798426ms
spin::Mutex avg 3.596568ms min 1.66903ms max 4.290803ms
AmdSpinlock avg 3.470115ms min 1.707714ms max 4.118536ms
std::sync::Mutex avg 4.486896ms min 2.536907ms max 5.821404ms
parking_lot::Mutex avg 2.712171ms min 1.508037ms max 5.44592ms
spin::Mutex avg 3.563192ms min 1.700003ms max 4.264851ms
AmdSpinlock avg 3.643592ms min 2.208522ms max 4.856297ms</code></pre>
Modern day user-mode pthread mutex uses the futex kernel syscall [1] to implement the lock, which avoids the syscall in the non-contended case, so it can be very fast, acquiring the lock in the user-mode code entirely. I'm not sure whether the Rust's mutex API is a wrapper on the pthread mutex or calling the old kernel's mutex syscall directly.<p>Basically the user mode mutex lock is implemented as:<p><pre><code> // In user-mode, if the lock flag is free as 0, lock it to 1 and exit
while !atomic_compare_and_swap(&lock, 0, 1)
// Lock not free, sleeps until the flag is changed back to 0
futex_wait(&lock, 0) // syscall to kernel
</code></pre>
When futex_wait() returns, it means the flag has been set back to 0 by the other thread's unlock, and the kernel wakes my thread up so I can check it again. However, another thread can come in and grab the lock in the meantime, so I need to loop back to check again. The atomic CAS operation is the one acquiring the lock.<p>[1] <a href="https://github.com/lattera/glibc/blob/master/nptl/pthread_mutex_lock.c#L168" rel="nofollow">https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...</a><p>Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG, so it's very fast to acquire a lock if no one else holding the lock.
This is pretty much the conclusion in this game related post on spinlocks:
<a href="https://probablydance.com/2019/12/30/measuring-mutexes-spinlocks-and-how-bad-the-linux-scheduler-really-is/" rel="nofollow">https://probablydance.com/2019/12/30/measuring-mutexes-spinl...</a>
> Second, the uncontended case looks like
> Parking_lot::Mutex avg 6ms min 4ms max 9ms<p>This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 <i>microsecond</i>.<p>I have a lot of experience here from writing low-latency financial systems. The hot path we use is littered with uncontested mutex lock/unlock, and the whole path still runs under 20 microseconds. (With the vast majority of that time unrelated to mutex acquisition.)<p>The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.
This absolutely makes sense in userspace. The most important part of a spinlock in an OS is that you can yield to the scheduler instead of taking up CPU time on the core. But that defeats the purpose of using a spinlock when in userspace because you still have to syscall
How about a new opcode wait till memory address read equals? That would allow implementing a power efficient spinlock.<p>Oh there is one already. Meet PAUSE: <a href="https://www.felixcloutier.com/x86/pause" rel="nofollow">https://www.felixcloutier.com/x86/pause</a><p>Edit: related post from 2018 <a href="https://news.ycombinator.com/item?id=17336853" rel="nofollow">https://news.ycombinator.com/item?id=17336853</a>
Note that all of the locks tested here are unfair, which is why they all show very high waiting variance. Until recently many mutex implementations aimed for fairness, which made them much slower than spinlocks in microbenchmarks like this.
This comes around every so often, and it isn't very interesting in that the best mutexes basically spins 1 or a couple times then falls back to a lock. It isn't a true pure spinlock vs pure lock (mutex/futex) fight.<p>I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.<p>His benchmark is weird, but maybe I'm reading it wrong:<p>* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).<p>* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.<p>* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.<p>The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.<p>What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.<p>(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)<p>The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.<p>And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.<p>A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.<p>(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)
Not an expert here.<p>In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.<p>On NUMA, I think flushing is more expensive.
Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)
TFA makes the point that modern "mutex" implementations actually use spinlocks first and only fall back to heavy, kernel Mutexes if there is contention. So the title is click-baity. Mutexes <i>are</i> slower than spinlocks. The "faster than spinlocks" mutexes in this article are actually spinlocks that fallback to mutexes.<p>Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.<p>Spinlocks are great in situations such as:<p><pre><code> * There are far more resources than threads,
* The probability of actually having to spin is small,
ideally if the time spent in the lock is a few instructions
* When you can't use optimistic concurrency*
</code></pre>
* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works<p>There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.
Coming as a ruby developer who dabbles in async rust in my free time, these posts/threads/links have been my best read of 2020 so far. My CS curriculum barely covered real-world lock usage, much less the ties to linux/windows schedulers + modern CPU caching. Big thanks all around to these internet commentators.
As the person that added a bunch of functionality to spin-rs making it roughly match the std API, yes you should not use spinlocks in scheduled code.<p>That said, I see why Rust makes things so annoying. I want lots of code to work in no-std so have a robust embedded and kernel ecosystem. It would be really nice to abstract over the locking implementation with type parameters, but that required "higher kinded types" which Rust doesn't have. The alternative is relying on carefully coordinated Cargo features, which is much more flaky and hard to audit (e.g. with the type system). Given that, I am not sure people over-use spin locks.
Thread scheduling (or waking up cores) is slow.
Because of this, mutexes will look better on dumb benchmarks, as the contending threads keep going to sleep, while the single succesful owner has practically uncontended access
Most mutexes implementations spin before truly acquiring the lock.<p>Also, there are better spinlock implementations, such as speculative spinlocks, and queued locks.
Fun explanation of what a Mutex is: <a href="https://stackoverflow.com/questions/34524/what-is-a-mutex" rel="nofollow">https://stackoverflow.com/questions/34524/what-is-a-mutex</a>