It’s Time to Move on from Two Phase Commit

397 pointsby evanweaverover 6 years ago

29 comments

A lot of people seem to be misunderstanding this post. Here is some background that you should keep in mind:1. The author is a well-known database researcher who studies deterministic databases. Deterministic database eliminate all randomness and race conditions, allowing multiple nodes to execute a shared transaction log with minimal coordination.2. Calvin is a deterministic database created by researchers, and FaunaDB is the commercial version of it.3. In this post, the author argues that one aspect of Calvin---eliminating two phase commit---could be separated from the overall design and used in traditional, non-deterministic distributed databases like Spanner, VoltDB, CockroachDB, and many others.

评论 #19003124 未加载

评论 #19003545 未加载

评论 #19005862 未加载

foobiekrover 6 years ago

There are a ton of real-world systems that actually do deferred settlement and reconciliation at their distributed edge - for example, in reality ATMs work this way (and credit cards, in a way). These systems should actually be thought of as sources feeding into a distributed log of transactions which are then atomically transacted in a more traditional data store after reconciliation. In systems like this, you must have a well defined system for handling late arrivals, dealing with conflicts, often you need some kind of anti-entropy scheme for correctness station keeping (which people should think about anyway and must people ignore), and so on. These systems are hard to reason about and have many many challenges and many of them that I have personally seen actually are impossible to make robustly secure (a byproduct of having been implemented prior to security being a strong consideration).In these applications the deferred abort problem is dealt with in the settlement and reconciliation phase and these events are themselves just events handled similarly.But this article is blurring the line between underlying transactional data stores where invariants can be guaranteed and the kinds of front-ends that loosen up on 2PC as a requirement.As an observation 2PC is not the problem, the problem is data scope. If the scope of the data necessary for the transaction to operate correctly is narrow enough there is no problem scaling transactional back ends. This gets back to traditional sharding, though, which people don’t like.

评论 #19000932 未加载

评论 #19001082 未加载

评论 #19002359 未加载

klodolphover 6 years ago

90% of the article is a satisfying analysis of problems two-phase commit. The remaining 10% of the article gives me no confidence that some alternative system is around the corner. Two-phase commit has the nice property that we know it's "correct", and there is a long history of alternative proposals that either aren't based on reasonable assumptions (e.g. we can make the network ordered), don't provide properties that make it easy to build systems on top of them (e.g. don't provide consistency), or aren't correct.So I'm not holding my breath until someone writes a paper about it. And even then, I would like to see someone build a system with it.

评论 #19000917 未加载

评论 #19002491 未加载

评论 #19000960 未加载

hodgesrmover 6 years ago

For this to work as easily as portrayed in the article it would imply that distributed consensus must be possible in the presence of failures. But such consensus is unsolvable [0] so it seems that this approach is just moving the problem to another place.It's difficult from reading this article to understand exactly what that place is. I would guess that it must at least in part involve limitation on the behavior of applications to eliminate category 1 application generated aborts. If so what do those applications look like?[0] <a href="https://en.wikipedia.org/wiki/Consensus_(computer_science)" rel="nofollow">https://en.wikipedia.org/wiki/Consensus_(computer_science)</a>

评论 #19000841 未加载

评论 #19000846 未加载

josh2600over 6 years ago

Without going too far down a rabbit hole, I think protocols like RAFT are fine for systems where you explicitly trust all of the machines in the network.That being said, designing with "I trust all machines in my network" as a core primitive feels unreal. Most people don't have a full byzantine quorum style system, but if you're a big company it's totally possible one of your boxes got owned and is propagating bad data (hell, I'd even venture to say it's likely that some machines are owned in very big infrastructures). If that's the case, where do you put the protocols to control for malicious actors?Quorum-based 3-phased commits can give strong guarantees at the cost of some performance (see the Stellar protocol[0], which is actually pretty amazing). It's really cool to be able to make systems like this. That being said, very few use cases have true byzantine behavior (most of the time the set of machines in your network that are malicious is really small), so I think it's pretty safe for almost all use cases to use something like RAFT (but again, you kinda need to explicitly acknowledge the threat model tradeoff).The question, as always, is what performance metrics are you designing for and what is your threat model? If you know those things, you can make educated choices about how to build your distributed systems.[0] <a href="https://datatracker.ietf.org/doc/draft-mazieres-dinrg-scp/" rel="nofollow">https://datatracker.ietf.org/doc/draft-mazieres-dinrg-scp/</a>

magicalhippoover 6 years ago

So if I got it right you're still left with a sort of two-phase system, first receive-work & ack-receive, followed by apply-work & ack-apply. However the worker is free to start with the second phase, and the next transaction after the final ack, as soon as it feels like without further coordination.This works because all conditional checks required by the combined work are performed identically by each worker.Doesn't this scale as O(n^2) worst case? If the work touches all the variables, and each variable has a constraint, then each worker must remotely read from every other worker, no?Also, since as far as I can see the workers need to keep old versions of the variables around (in case a remote worker recovering needs to satisfy a remote read), how are these garbage collected?Quite possibly dumb questions, this isn't my area at all. Enjoyed the read in any case.

评论 #19009912 未加载

评论 #19004007 未加载

richardwhiukover 6 years ago

> Category (1) can always be written in terms of conditional logic on the data as we did above.is a bold and unjustified statement.If any downstream system uses 2PC commit, then the upstream system can not use this new scheme.The other category seems to have just been hand-waved over (we designed the system to never crash in a way which loses the ability to perform transactions, so we don't worry about this scenario).

评论 #19000583 未加载

评论 #19000617 未加载

评论 #19000584 未加载

adrianmonkover 6 years ago

Time to move on? Let's not jump the gun here and conclude that before we're sure everybody is ready.

orangepenguinover 6 years ago

To me, the system-induced abort scenario is the more difficult to address, and this article hasn't really addressed the problem. It sounds like he's saying "just don't give workers the option to abort", as if the workers were deliberately causing issues.One can say "just don't give the spaceship the option to go slower than the speed of light" but saying so doesn't change anything about the underlying physical constraints.

评论 #19000880 未加载

exabrialover 6 years ago

Article Summary: 2PC is not infallible, therefore, never use 2PC.What he should have said: People often engineer thinking 2PC will never fail. In reality it can, and if you use 2PC in a certain way you can also exacerbate the issue (distributed transactions). Instead, you should make the surface area in 2PC as small as possible to minimize impact of a failed 2PC. In addition, you are probably not monitoring for failures. Start doing that.

评论 #19001595 未加载

评论 #19002634 未加载

Sammiover 6 years ago

Can I just say how immensely well written the first two paragraphs are?So clear and concise. Telling em upfront what will be discussed.The article itself is many pages long but I feel like I know exactly what I am in for and whether or not it is for me or not, so I know whether or not it will be worth the read for me or not. Thank you!

solaticover 6 years ago

OP's argument boils down to this:> it is always possible to rewrite any transaction... in order to replace abort logic in the code with if statements that conditionally check the abort conditions [in real-world systems].I'm reminded of the original criticism for using NoSQL databases for systems of record - sure, removing relational constraints can give you massive performance benefits, but all data is inherently relational and denying it was guaranteed to come back to bite you someday, as it did for many early adopters.Of course it's always possible to front-load the reasons to abort the transaction to the client and instruct the client not to commit transactions that wouldn't be committable. But whether that's always possible in real-world systems? That needs a formal proof. My inclination is to dismiss this claim - not only does shifting verification to the client introduce security concerns, but the conservation of complexity guarantees that a wait is a wait regardless of whether the client needs to wait until the server has instructed the client that the client's proposed commit would be legal, or whether the client needs to wait until the server has verified that the commit was accepted.I'm not saying Calvin / FaunaDB won't have its uses - but I do reject the claim that any system that currently uses a relational database could switch to Calvin/FaunaDB, retain all of its current properties and guarantees, and become more performant in the process.

评论 #19009926 未加载

jamesblondeover 6 years ago

Some of the stuff in here is just wrong."they have to block --- wait until the coordinator recovers --- in order to find out the final decision"This assumes there is only one coordinator in the system and that there cannot be another coordinator that 'takes over'. Here's a good example of a real-world 2PC system that is non-blocking if the coordinator fails - NDB: <a href="http://mikaelronstrom.blogspot.com/2018/09/non-blocking-two-phase-commit-in-ndb.html" rel="nofollow">http://mikaelronstrom.blogspot.com/2018/09/non-blocking-two-...</a>In NDB, if a participant fails, yes you have to wait until a failure detector indicates it has failed. But in production systems, that is typically 1.5 to 5 seconds. In effect, it is the same effect as the leader in Fast-Paxos failing - you need a failure detector and then run a leader election protocol. That's why we call practical Paxos 'abortable consensus' - it can abort and be retried. Similarly, TPC can be made 'non-blocking' with abortable retries if transactions fail. In effect, they can be made somewhat interchangeable ( "consensus on transaction commit").

评论 #19009961 未加载

sqldbaover 6 years ago

Well I don’t get it.The OP never mentions that the reason there’s a 2PC is because the client has to know whether it worked or whether to resubmit, and missing from the list of reasons why is network issues.It seems to me in this world the client never receives a notice and the transaction commits in the background. I don’t know how devs are going to deal with that. Just always retry and swallow the error when it happens the second time and the data is already there?

评论 #19004323 未加载

评论 #19003932 未加载

ww520over 6 years ago

The alternative to 2PC is to have a compensating transaction log that always goes forward. The updates are recorded in the transaction log. Each update is then shipped to the workers where they see if the update applied to them, and commit the ones relevant in their own databases. There's no rollback. A logical "rollback" can be applied by issuing a compensating transaction to negate the previous effect. E.g. issuing a new transaction of $10 debit to compensate for the previous transaction of $10 credit.Example, a worker database maintains the Inventory table. Another worker database maintains the Order table. When the frontend app creates an order, it records the transaction [ (decrement part in Inventory), (create new record in Order) ] in the compensating transaction log. Note that both updates are in one transaction.When the Inventory worker receives the transaction, it applies the update relevant to its tables, i.e. the Inventory table, and ignore the Order update. The Order worker receives the same transaction and applies the relevant Order update while ignoring the Inventory update in the transaction.In the case of the order is canceled, a compensating transaction of [ (increment part in Inventory), (mark record as canceled in Order) ] can be created in the transaction log. The workers can pick up the transaction and apply the update relevant to them.Both worker databases are de-coupled and can work in their own pace. They can crash can come back up without affecting the other. At the end of day, things are reconciled and are consistent.The downside to this scheme is the timeliness of the data. One system can be down or slow to apply the updates and lag behind the other one.

评论 #19003535 未加载

hestefiskover 6 years ago

With such a provocative title it’s daunting to have to scroll through so many paragraphs to understand why. Seriously, write the so what / answer first in the first paragraph if you don’t want people to misunderstand or judge the content prematurely. I know academic literature prefers to keep analysis deep inside text, but please write with the reader in mind. It’s not that hard.

oggyover 6 years ago

I don't understand how this can be generalized to non-deterministic systems. First off, the definition of non-deterministic is unclear to me. His definition of deterministic seems wrong, as he's using a set of requests. Sets are unordered, so this would imply that the order in which requests execute is irrelevant - which is true for commutative operations, but Calvin doesn't seem to be limited to those.Assuming that "deterministic" should be defined on lists of inputs, then I don't understand how the approach can be applied to nondeterministic databases. The crux seems of the approach seems to be in consistent distributed reads (which you can get by globally ordering transactions and MVCC). But for fault-tolerance, these must also be repeatable, and I don't see how to make them repeatable if the state of replicas in a shard is allowed to diverge.

评论 #19009978 未加载

aptxkidover 6 years ago

If you read it closely, this is very similar to RAMP transaction. Both has pretty significant write amplification for storing transaction metadata and multiple versions of each key. By storing txn metadata and multiple versions, it provides many nice attributes like non-blocking, concurrent transactions, etc.The difference between Abadi's proposal and RAMP is that it moves the "second phase" to the worker, which performs the remote read, to figure out the txn decision.I think this proposal should be better compared to RAMP instead of 2pc. And even in RAMP paper, it states that this doesn't solve all the problems. E.g. how would you do the traditional read-modify-writes?

pacalaover 6 years ago

Perhaps I'm not getting the main point right. Looks like the proposal is a minor optimization of the 2PC protocol, where a worker ACKs the transaction as soon as possible, presumably somewhere between updating the durable log and updating the durable data. However, said worker cannot proceed to execute a subsequent transaction depending on the updated data until the 2PC protocol completes, because _another_ worker may abort the original transaction for perfectly logical reasons, as opposed to transient failure reasons.

评论 #19000786 未加载

评论 #19000766 未加载

jhallenworldover 6 years ago

For every worker to be able to determine whether the transaction will abort or not requires that all workers have the information necessary to make this decision available to them (so data has to be duplicated).So I'm wondering if this information is available, then is there any speed-up left available for there to be an advantage to a distributed database? Maybe there is no difference between non-distributed vs. distributed but with the slowdown from having the commit protocol.

评论 #19001819 未加载

grogersover 6 years ago

Doesn't this mean the DB can only execute stored procedures, since with interactive transactions it wouldn't have knowledge of the conditionals that cause aborts?Have anyone used a DB like that in production? I'm curious because when using RDBMSs it's typical to avoid stored procedures completely. It seems like it would be very difficult to use and deploy new code for.

drpixieover 6 years ago

While plenty of the article is sensible, the author unfortunately skips very briefly past the one killer situation.It's not enough to say "[Transaction restart gets a little tricky in nondeterministic systems if some of the volatile state associated with a transaction that was lost during a failure was observed by other machines that did not fail. But there are simple ways to solve this problem that are out of scope for this post.]"Any distributed system has the potential to completely loose some state ... say halfway through a transaction coordinating 2 servers, one bursts into flame and is completely destroyed, along with its non-volatile storage (log files). The other server must rollback (abort) the transaction or we all accept the system is no longer consistent.There are no known ways to resolve this problem. Either accept the risk, or manage it outside the computer system.PS. Don't bother adding extra phases to 2PC, that just delays the decision. The extra phases can't provide any definitive commit/abort answer more than would have been provided by 2PC.

evanmoranover 6 years ago

This is an important problem, though certainly hard to solve. Has the author or anyone tried something like this for changing list in parallel (not just changing single values)? Making this work on addition and removal from a collection would be really interesting.

peterwwillisover 6 years ago

The article's a little dense, so here's my lame attempt at a summary:- 2PC exists because we all assume computers suck.- 2PC is annoying and slow because it's a lot of extra work/complexity.- Let's get around this by designing systems so transactions are not affected by computer failures, and then we don't need all the extra 2PC crap.- Wtf? How?- Avoid deadlocks, and restart a transaction on failure using original input.- How?- waves hands something something Calvin something something Anything Else Is A Bit ComplicatedPersonally I believe there is a solution here, but there needs to be more "proof" of how existing systems can be retooled to use it. It's not like people are just going to abandon Oracle tomorrow.

alexnewmanover 6 years ago

I actually think the problem is 2PC is known not to be reliable enough and we should switch to 3PC or full single decree paxos for anything over a WAN

评论 #19002070 未加载

ngcc_hkover 6 years ago

I thought that is a feature of beast if you have 2 parties want to communicate over an unreliable network.

kyberiasover 6 years ago

I ain't moving on, unless you are!

holodukeover 6 years ago

Time to move on? What do we really have issues with the current situation? I d like to hear that

评论 #19000739 未加载

rpzover 6 years ago

Kx moved on >20 years ago.<a href="https://a.kx.com/a/kdb/document/contention.txt" rel="nofollow">https://a.kx.com/a/kdb/document/contention.txt</a>