How to do distributed locking (2016)

244 点作者 yusufaytas7 个月前

10 条评论

At work we use Temporal and ended up using a dedicated workflow and signals to do distributed locking. Working well so far and the implementation is rather simple, relying on Temporal’s facilities to do the distributed parts of the lock.

评论 #41902284 未加载

评论 #42010620 未加载

评论 #41898031 未加载

eknkc7 个月前

I tend to use postgresql for distributed locking. As in, even if the job is not db related, I start a transaction and obtain an advisory lock which stays locked until the transaction is released. Either by the app itself or due to a crash or something.Felt pretty safe about it so far but I just realised I never check if the db connection is still ok. If this is a db related job and I need to touch the db, fine. Some query will fail on the connection and my job will fail anyway. Otherwise I might have already lost the lock and not aware of it.Without fencing tokens, atomic ops and such, I guess one needs a two stage commit on everything for absolute correctness?

评论 #41899374 未加载

评论 #41895763 未加载

antirez7 个月前

I suggest reading the comment I left back then in this blog post comments section, and the reply I wrote in my blog.Btw, things to note in random order:1. Check my comment under this blog post. The author had missed a fundamental point in how the algorithm works. Then he based the refusal of the algorithm on the remaining weaker points.2. It is not true that you can't wait an approximately correct amount of time, with modern computers an APIs. GC pauses are bound and monotonic clocks work. These are acceptable assumptions.3. To critique the auto release mechanism in-se, because you don't want to expose yourself to the fact that there is a potential race, is one thing. To critique the algorithm in front of its goals and its system model is another thing.4. Over the years Redlock was used in a huge amount of use cases with success, because if you pick a timeout which is much larger than: A) the time to complete the task. B) the random pauses you can have in normal operating systems. Race conditions are very hard to trigger, and the other failures in the article were, AFAIK, never been observed. Of course if you have a super small timeout to auto release the lock, and the task may easily take this amount of time, you just committed a deisgn error, but that's not about Redlock.

评论 #41895393 未加载

评论 #41895296 未加载

anonzzzies7 个月前

I am updating my low level and algo knowledge; what are good books about this (I have the one written by the author). I am looking to build something for fun, but everything is either a toy or very complicated.

评论 #41896260 未加载

egcodes7 个月前

Once I wrote a dist. lock blog using this resource. Here it is: <a href="https://medium.com/sahibinden-technology/an-easy-integration-of-distributed-lock-4b19a704ce49" rel="nofollow">https://medium.com/sahibinden-technology/an-easy-integration...</a>

dataflow7 个月前

> The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.Hold on, this sounds absurd to me:First, if your client crashes, then you don't need a timed lease on the lock to detect this in the first place. The lock would get released by the OS or supervisor, whether there are any timeouts or not. If both of those crash too, then the connection would eventually break, and the network system should then detect that (via network resets or timeouts, lack of heartbeats, etc.) and then invalidate all your connections before releasing any locks.Second, if the problem becomes that your client is buggy and thus holds the lock too long without crashing, then shouldn't some kind of supervisor detect that and then kill the client (e.g., by the OS terminating the process) before releasing the lock for everybody else?Third, if you are going to have locks with timeouts to deal with corner cases you can't handle like the above, shouldn't they notify the actual program somehow (e.g., by throwing an exception, raising a signal, terminating it, etc.) instead of letting it happily continue execution? And shouldn't those cases wait for some kind of verification that the program was notified before releasing the lock?The whole notion that timeouts should somehow permit the program execution to continue ordinary control flow sounds like the root cause of the problem, and nobody is even batting an eye at it? Is there an obvious reason why this makes sense? I feel I must be missing something here... what am I missing?

评论 #41897034 未加载

评论 #41897032 未加载

hoppp7 个月前

I did distributed locking with Deno, and Deno KV hosted by Deno Deploy.Its using foundationdb, a distributed db. The deno instances running on local devices all connect to the same Deno KV to acquire the lock.But using postgres, a select for update also works, the database is not distributed tho.

jroseattle7 个月前

We reviewed Redis back in 2018 as a potential solution for our use case. In the end, we opted for a less sexy solution (not Redis) that never failed us, no joke.Our use case: handing out a ticket (something with an identifier) from a finite set of tickets from a campaign. It's something akin to Ticketmaster allocating seats in a venue for a concert. Our operation was as you might expect: provide a ticket to a request if one is available, assign some metadata from the request to the allocated ticket, and remove it from consideration for future client requests.We had failed campaigns in the past (over-allocation, under-allocation, duplicate allocation, etc.) so our concern was accuracy. Clients would connect and request a ticket; we wanted to exclusively distribute only the set of tickets available from the pool. If the number of client requests exceeded the number of tickets, the system should protect for that.We tried Redis, including the naive implementation of getting the lock, checking the lock, doing our thing, releasing the lock. It was ok, but administrative overhead was a lot for us at the time. I'm glad we didn't go that route, though.We ultimately settled on...Postgres. Our "distributed lock" was just a composite UPDATE statement using some Postgres-specific features. We effectively turned requests into a SET operation, where the database would return either a record that indicated the request was successful, or something that indicated it failed. ACID transactions for the win!With accuracy solved, we next looked at scale/performance. We didn't need to support millions of requests/sec, but we did have some spikiness thresholds. We were able to optimize read/write db instances within our cluster, and strategically load larger/higher-demand campaigns to allocated systems. We continued to improve on optimization over two years, but not once did we ever have a campaign with ticket distribution failures.Note: I am not an expert of any kind in distributed-lock technology. I'm just someone who did their homework, focused on the problem to be solved, and found a solution after trying a few things.

评论 #41895829 未加载

评论 #41895977 未加载

评论 #41896180 未加载

评论 #41896833 未加载

评论 #41895681 未加载

评论 #41897029 未加载

评论 #41896281 未加载

评论 #41897993 未加载

galeaspablo7 个月前

Many engineers don’t truly care about the correctness issue, until it’s too late. Similar to security.Or they care but don’t bother checking whether what they’re doing is correct.For example, in my field, where microservices/actors/processes pass messages between each other over a network, I dare say >95% of implementations I see have edge cases where messages might be lost or processed out of order.But there isn’t an alignment of incentives that fixes this problem. Ie the payment structures for executives and engineers aren’t aligned with the best outcome for customers and shareholders.

评论 #41895586 未加载

评论 #41895051 未加载

评论 #41895407 未加载

评论 #41894888 未加载

评论 #41897997 未加载

jmull7 个月前

This overcomplicates things...* If you have something like what the article calls a fencing token, you don't need any locks.* The token doesn't need to be monotonically increasing, just a passive unique value that both the client and storage have.Let's call it a version token. It could be monotonically increasing, but a generated UUID, which is typically easier, would work too. (Technically, it could even be a hash of all the data in the store, though that's probably not practical.) The logic becomes:(1) client retrieves the current version token from storage, along with any data it may want to modify. There's no external lock, though the storage needs to retrieve the data and version token atomically, ensuring the token is specifically for the version of the data retrieved.(2) client sends the version token back along with any changes.(3) Storage accepts the changes if the current token matches the one passed with the changes and creates a new version token (atomically, but still no external locks).Now, you can introduce locks for other reasons (hopefully goods ones... they seem to be misused a lot). Just pointing out they are/should be independent of storage integrity in a distributed system.(I don't even like the term lock, because they are temporary/unguaranteed. Lease or reservation might be a term that better conveys the meaning.)