Please note that I replied to this blog post, when it was published. You may be interested in reading the counter arguments. The author of the original post acknowledged that at least one of his claim was wrong, but I don't know if they modified the post later. Anyway we gained some time perspective, too: Redlock was massively used in the last 10 years, and I have never heard of its properties failing in the practice: that is, no incidents so far.<p><a href="http://antirez.com/news/101" rel="nofollow">http://antirez.com/news/101</a>
Did a simple Paxos implementation a while ago which I turned into educational code, well documented and coming with a talk and a workshop. May it be useful for anyone wanting to learn about distributed locking from first principles.<p>Code: <a href="https://github.com/danrl/skinny">https://github.com/danrl/skinny</a><p>Talks: <a href="https://youtu.be/nyNCSM4vGF4" rel="nofollow">https://youtu.be/nyNCSM4vGF4</a> and <a href="https://youtu.be/gAGGPaFDfwE" rel="nofollow">https://youtu.be/gAGGPaFDfwE</a>
This is very interesting, thank you for this.<p>I am trying to implement efficient mutual exclusion without blocking and without additional mutexes besides communication mutexes (what I use to provide thread safety) in Java. (I think message passing still needs some form of thread safety to transfer data, such as a mutex or lock free algorithm, since you cannot mutually read and write to the variable without problems.)<p>If the solution is proven safe in the small, across threads, it should be safe across machines and in distributed systems. I might be wrong on this. But scheduling is a different solution to locks than allowing everything to run at its own behest, without external control. In other words, if we run things and wait for them to finish and never schedule two things to run at the same time that are incompatible (mutual exclusion), we can avoid race conditions where things decide to run at the same time due to two things trying to run simultaneously.<p>I effectively want epoll for business domain objects and infrastructure state changes. This is the space that Hashicorp consul provides, redis lock and Chubby, the Google lock service.<p>Imagine being able to register on arbitrary changes to business objects and chaining together complicated behaviours based on efficient epoll-style reactive rules? And have fork/join handled efficiently and mutual exclusion and retries.<p>I've gathered I can use scheduling, similar to an operating system scheduler to decide when to schedule mutually exclusive operations.<p>I am inspired by the epoll API, where you register things you're interested in changes and you react to them.<p>I can protect everything by a single communication lock and then implement scheduling myself. Scheduling shall detect when a task is finished all its dependencies and then schedule the next thing to be executed.<p>Any thoughts or ideas on distributed and multithreaded scheduling would be greatly appreciated.
FWIW there was a reply to that post here: <a href="http://antirez.com/news/101" rel="nofollow">http://antirez.com/news/101</a>
I also wrote about distributed locking a while ago, particularly about implementing one on Google Cloud Storage. <a href="https://www.joyfulbikeshedding.com/blog/2021-05-19-robust-distributed-locking-algorithm-based-on-google-cloud-storage.html" rel="nofollow">https://www.joyfulbikeshedding.com/blog/2021-05-19-robust-di...</a><p>This is useful because cloud storage is very cheap and serverless. It certainly beats running a Redis or PostgreSQL instance.<p>In my research and implementation I took care of the problems described in this article (as well as the problems I encountered in other distributed lock implementations).<p>Clients either freezing or outright crashing or disappearing is definitely a major problem. So timeouts are a must. But you can't <i>just</i> have a timeout because you don't know whether a client is arbitrarily delayed or whether it has disappeared. Furthermore, some workloads have an unknown completion time so you can't pick <i>any</i> sensible timeout.<p>I address this by encouraging clients to behave as follows:<p>1. Make the timeout relatively short, but regularly <i>refresh</i> your lease to let others know that you're still alive.<p>2. Your lease may at any time be taken over by another client that thinks you've disappeared. So you must regularly check whether your lease on the lock is still valid. This checking must be done way, way before your timeout expires, in order to account for the possibility that you can be arbitarily delayed at any moment (GC pauses, scheduling pauses, network delays, etc). I recommend checking in an interval at most 1/8 of the timeout. Smaller intervals are better at the expense of more overhead.<p>3. Attempt to split up your work into a long-running, discardable preparation phase and a short-running commit phase. Verify the validity of your lock lease right before committing (in addition to periodic verification).<p>This is of course still not 100% foolproof. There is still risk that two clients run at the same time, but this risk can be arbitrarily reduced by reducing the verification interval and by optimizing for doctrine 3. If your workload cannot tolerate <i>any</i> risk of two clients running at the same time, no matter how small the risk, then pick an infinitely timeout. But I think for most workloads, the benefits of not having to administer the lock (not having to manually break its lease because a client has crashed) outweight the risk.<p>My algorithm has a Ruby implementation, as well as a third-party Go implementation.<p>Ruby: <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby">https://github.com/FooBarWidget/distributed-lock-google-clou...</a><p>Go: <a href="https://github.com/FooBarWidget/distributed-lock-google-cloud-storage-ruby/issues/2">https://github.com/FooBarWidget/distributed-lock-google-clou...</a>