Consistency Without Clocks: FaunaDB's Distributed Transaction Protocol

171 pointsby evanweaverover 6 years ago

20 comments

rystsovover 6 years ago

"unlike Google Percolator, FoundationDB, or similar systems, FaunaDB places no constraints on replica distance and is practical to deploy at global internet latencies""For each batch of parallel transactions, they are inserted into a distributed, write-ahead transaction log""Replicas must achieve consensus for how to insert new transactions into the log. FaunaDB uses an optimized Raft implementation to achieve consensus."There are constrains on running consensus across the world (Raft), it adds at least 200 ms to serialize txs. Also higher latency means longer interval between hearbeats and hence longer downtime if leader is isolated - known issue of leader based consensus protocols (see "There Is More Consensus in Egalitarian Parliaments" paper[1] or "In search of a simple consensus algorithm" post[2])Google's Percolator doesn't depend on global consensus but just on global TSO (timestamp oracle) which is possible to implement in a way:- it doesn't suffer from leader isolation (no leader)- doesn't have bottleneck (each node handles requests)- doesn't touch disk on each requestdetails in the "Quorum clock: leaderless distributed clock" post[3].[1] <a href="https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf" rel="nofollow">https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf</a>[2] <a href="http://rystsov.info/2017/02/15/simple-consensus.html" rel="nofollow">http://rystsov.info/2017/02/15/simple-consensus.html</a>[3] <a href="http://rystsov.info/2018/10/01/tso.html" rel="nofollow">http://rystsov.info/2018/10/01/tso.html</a>

andydbover 6 years ago

The paper contains this claim:> FaunaDB is an elegant, software-only solution for achieving global ACID transactions, with complete guarantees of serializability and consistency.The paper makes it sound like FaunaDB claims strict serializability for all transactions (like Spanner). This means that if Txn(A) ends before Txn(B) begins, then Txn(B) must be guaranteed to see all data written by Txn(A).However, what if Txn(B) is a read-only transaction? According to the paper, read-only transactions do not contact the global sequencer. Here is an example where an application would fail to read its own writes:1. Txn(A) performs a write W(X) of data on replica R1.2. Txn(A) is assigned a logical timestamp of T1 by the global log.3. Conflict verification is triggered. Replica R1 plays back the global log and verifies there is no conflict.4. Txn(A) completes, since one replica has verified there are no conflicts, and passes control back to the application.5. The application now triggers a read-only Txn(B).6. The coordinator for Txn(B) decides to read replica R2 rather than R1. However, the coordinator is not yet aware that the global log is now at timestamp T1, and it picks T0 as its snapshot timestamp.7. It reads the value of X on replica R2, which does not yet reflect W(X).-- From the application's point of view, it did not read its own writes. --I'm not familiar enough with Fauna DB's subtleties to know if this scenario is possible in practice. Perhaps you could comment?I did notice that at the end of the article, the language is carefully phrased to only make the claim of "serializable" (not "strictly serializable") for read-only transactions. But that would fall short of the guarantees that Spanner and Calvin make, and undermine the "complete guarantees of serializability and consistency".

评论 #18260381 未加载

评论 #18261535 未加载

hardwaresoftonover 6 years ago

I've come to the realization that sharding might be the only way to actually scale multi-master systems without allowing stale (or potentially outright wrong) reads.Sharding is complex and makes parallel queries a requirement but it really does seem like the only way to distribute a database. In the end, once you have the op log on more than one machine, they're going to have to synchronize with each other, otherwise you risk giving out not just stale answers, but completely wrong answers. This is fine for a site like FB but is not OK if it's attached to military equipment.CRDTs don't help in the case that you don't want to deliver wrong answers, they only help making sure merges (during synch time) are resolvable without conflicts. Lamport clocks only give you a partial order, and you need something else (physical clocks) for total order, which is what you get from raft (because you only let one node manage the log in the first place).What I haven't seen yet/want to try is to see if sharded + replicated raft could introduce some interesting performance benefits. It's basically max complexity (the complexity plus sharding, plus multiple instances of raft at the same time), but it could be only way to distribute write load (smart routing/forwarding of requests) and increase availability (eventually consistent replication for every node of every shard) at the same time. This is basically what Faunda does but I'm not 100% sure about splitting a single table across replicas...

评论 #18365139 未加载

评论 #18261736 未加载

netgustoover 6 years ago

Summary, taken from the article:> To summarize the overall FaunaDB protocol, each transaction proceeds in three phases:> 1. The first phase is a speculative phase in which reads are performed as of a recent snapshot, and writes are buffered.> 2. Next, a consensus protocol is used (Raft) to insert the transaction into a global log, which results in the transaction receiving a global transaction identifier that specifies its equivalent serial order relative to all other transactions that are being processed by the system. This is the only point at which global consensus is required.> 3. Finally, a check begins in each replica which verifies the speculative work. If that speculative work did not result in potential violations of serializability guarantees, then the work becomes permanent and the buffered writes written back to the database. Otherwise, the transaction is aborted and restarted.

评论 #18260967 未加载

agentultraover 6 years ago

I wish projects like this would publish their proofs with their protocols. It's interesting to be sure but I find all of the prose and diagrams to be too verbose. I'd much rather read the mathematical model of the transaction protocol.

评论 #18258426 未加载

fernlyover 6 years ago

The concept of the "global log" is mentioned many times in the article, but not much detail on this critical piece:> ... the consensus protocol is only being used for inserting transactions into a global log. For every other part of the protocol, replicas can proceed completely independently from each other.Clearly this central resource can't be distributed or replicated. Right? So it, and the hardware it runs on, and the communications links to it, are a single point of failure. True?If the global log goes down, or can't be reached at all, or perhaps only from some region of the regionized application, what happens?

评论 #18258611 未加载

评论 #18259231 未加载

评论 #18258583 未加载

evanweaverover 6 years ago

We're here, ready for your questions to be consistently replicated across the WAN at the lowest latency information science allows ;-)

评论 #18258605 未加载

评论 #18257224 未加载

评论 #18261849 未加载

kuujoover 6 years ago

“Consisteny without clocks” seems like a misnomer considering Raft indexes are tantamount to a logical clock. Perhaps it should be “consistency without wall clocks.” But that’s not really novel by itself, except in the context of geo-distributed databases.

gigatexalover 6 years ago

This is a weakness imo:`client.query( q.get( q.match( q.index("posts_by_title"), "My cat and other marvels" ) ))`That you have to specify the index in a query is a regression imo at least I've been spoiled by not having to have to do so in when using Mongo or SQL databases since the query engine will more often than not find the right index.

评论 #18259755 未加载

评论 #18259810 未加载

wikibobover 6 years ago

This looks like a fascinating approach. Unfortunately I’m not well versed enough to intelligently compare it to alternatives.Any plans to have Jepsen / Aphyr conduct a rigorous test and write a report on the results?

评论 #18258265 未加载

donpdonpover 6 years ago

thanks drift.com (and driftt.com) for driving me to learn how to block specific sites in uBlock. Audio 'pops' for a webpage chatroom is a new level of annoyance.

评论 #18258316 未加载

whargarblover 6 years ago

As an aside, can somebody explain the constraints FoundationDB puts on replica distance?

评论 #18257937 未加载

grogersover 6 years ago

I'm really confused how this scales to high transaction rates. If the replica has to redo all the reads (which means talking to multiple nodes) before it can make a commit/abort decision for the transaction this could take tens of microseconds if all nodes are in the same datacenter (if serving from RAM). Since it also has to process transactions from the log in order, that seems like it would limit the transaction rate to tens of thousands of TPS? Forget about distributing a replica across data centers or having enough data that it may not be RAM resident.Is this actually how it works or am I missing something important?

andras_gerlitsover 6 years ago

I've published this article 5 months ago. Mine doesn't require global consensus for all commits, just local one and it works from there.What's even more annoying is that I tweeted this article to them and they didn't say anything.<a href="https://medium.com/p/optimistic-acid-transactions-4f193844bb5f" rel="nofollow">https://medium.com/p/optimistic-acid-transactions-4f193844bb...</a>

评论 #18263717 未加载

评论 #18263971 未加载

chapparover 6 years ago

How does FaunaDB achieves atomicity without some sort of 2 phase commit? What I understood from the article is that each node in a replica commits independently from the distributed transaction log. So, if a transaction updates data in multiple nodes in a replica then it can happen that one of the commit in one of the nodes fails because of some failure. In that case will there be partial commit?

kerneltimeover 6 years ago

Apologies, I quickly skimmed the blog but in the summary did not see clear answers for: When is the transaction acked? What is the reader-writer consistency? If the transaction is acknowledged only after speculative work is validated after ordering, what is the value add for doing the speculative work? Would you achieve similar benefits from just batching commits?

评论 #18259663 未加载

gigatexalover 6 years ago

Hmm. I’m intrigued though skeptical. Who’s using this in production that can give an unbiased take?

评论 #18259713 未加载

EGregover 6 years ago

This is very coolBut there is an even harder super boss level for distributed systems: Byzantine Fault ToleranceIs FaunaDB Byzantine resistant? I have asked many projects, such as Dat, what happens if there are conflicts, and they haven’t built systems that are BFT yet.

评论 #18258945 未加载

评论 #18259311 未加载

EngineerBetterover 6 years ago

The global log sounds like a lynchpin but I didn't see a good explanation in the article. It sounds like all synchronization is done at the level of this log? If so, isn't that a bit of a bottleneck?

评论 #18258912 未加载

PHGamerover 6 years ago

is the issue not so much other dbs cant provide that global consitency but other dbs just choose not to for the sake of speed. like their are trade offs

20 comments

rystsovover 6 years ago

andydbover 6 years ago

评论 #18260381 未加载

评论 #18261535 未加载

hardwaresoftonover 6 years ago

评论 #18365139 未加载

评论 #18261736 未加载

netgustoover 6 years ago

评论 #18260967 未加载

agentultraover 6 years ago

评论 #18258426 未加载

fernlyover 6 years ago

评论 #18258611 未加载

评论 #18259231 未加载

评论 #18258583 未加载

evanweaverover 6 years ago

We're here, ready for your questions to be consistently replicated across the WAN at the lowest latency information science allows ;-)

评论 #18258605 未加载

评论 #18257224 未加载

评论 #18261849 未加载

kuujoover 6 years ago

gigatexalover 6 years ago

评论 #18259755 未加载

评论 #18259810 未加载

wikibobover 6 years ago

评论 #18258265 未加载

donpdonpover 6 years ago

thanks drift.com (and driftt.com) for driving me to learn how to block specific sites in uBlock. Audio 'pops' for a webpage chatroom is a new level of annoyance.

评论 #18258316 未加载

whargarblover 6 years ago

As an aside, can somebody explain the constraints FoundationDB puts on replica distance?

评论 #18257937 未加载

grogersover 6 years ago

andras_gerlitsover 6 years ago

评论 #18263717 未加载

评论 #18263971 未加载

chapparover 6 years ago

kerneltimeover 6 years ago

评论 #18259663 未加载

gigatexalover 6 years ago

Hmm. I’m intrigued though skeptical. Who’s using this in production that can give an unbiased take?

评论 #18259713 未加载

EGregover 6 years ago

评论 #18258945 未加载

评论 #18259311 未加载

EngineerBetterover 6 years ago

The global log sounds like a lynchpin but I didn't see a good explanation in the article. It sounds like all synchronization is done at the level of this log? If so, isn't that a bit of a bottleneck?

评论 #18258912 未加载

PHGamerover 6 years ago

is the issue not so much other dbs cant provide that global consitency but other dbs just choose not to for the sake of speed. like their are trade offs