This is up for debate but this[0], IMHO, is pretty much the gold standard of explaining distributed algorithms.<p>[0] - <a href="http://thesecretlivesofdata.com/raft/" rel="nofollow">http://thesecretlivesofdata.com/raft/</a>
Software engineers love Paxos because it takes something very complex (a distributed system) and makes it equivalent to working with a single machine: you only ever talk to the leader. It gives you redundancy at the expense of performance.<p>Paxos is used to achieve something called Strong Consistency, where each node sees the same message in the same order. If you think of each node as a deterministic state machine, they are guaranteed to end up in the same state after responding to the same sequence of messages. It's nice and intuitive, but requiring global synchronization on every write is terrible for performance.<p>Other consistency schemes exist. A popular one is Eventual Consistency, where writes are made immediately at any node (not just the leader) and the system is expected to synchronize in the background and "converge" to the same state. However, this can result in merge conflicts: if you're editing a document in collaboration with other users, what if you edit a word in a paragraph while another user deletes that entire paragraph? Does the system resolve this automatically, or require user assistance? The answer to this question varies according to system requirements. I think most HN users have experienced the joys of resolving merge conflicts.<p>A newer model is something called Strong Eventual Consistency, which is similar to Eventual Consistency but merge conflicts are impossible by design: every update to the system must be commutative, associative, and idempotent with other updates. It is not always possible to design your system this way. These systems are implemented with Conflict-Free Replicated Data Types (or ad-hoc equivalents) and have excellent liveness/throughput/performance characteristics compared to Strong Consistency.<p>CRDTs are not as simple as Paxos. You're forced out of the cozy one-system world and your system must deal with two nodes concurrently holding different values. For most applications, magic Paxos dust is all you need. For others, CRDTs are an excellent tool.
This well-illustrated post is technically about the core Synod agreement protocol in Paxos. Building a consistent distributed service on top requires additional scaffolding and infrastructure. Typically, people layer on a system that implements a "replicated state machine (RSM)" on top, which maintains the illusion of a single consistent object, even though it is composed of distributed replicas.<p>Also keep in mind that Raft, Zab, and View-Stamped replication (in reverse chronological order) are alternatives to the Synod protocol in Paxos. These protocols differ from Paxos by employing a different leader-election mechanism and slightly different way of maintaining their invariants.<p>There have been many Paxos variants. This site [1] shows the various Paxos variants over a timeline and points out their contributions.<p>Those of you interested in building replicated state machines using Paxos should take a look at OpenReplica [2]. It is a full Multi-Paxos implementation that takes any Python object and makes it distributed and fault-tolerant, like an RPC package on steroids.<p>[1] <a href="http://paxos.systems/" rel="nofollow">http://paxos.systems/</a><p>[2] <a href="http://openreplica.org/faq/" rel="nofollow">http://openreplica.org/faq/</a>
Also interesting: [1]<p>> Raft is a consensus algorithm that is designed to be easy to understand. It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today.<p>[1] <a href="https://raftconsensus.github.io/" rel="nofollow">https://raftconsensus.github.io/</a>
> Honest-to-goodness real-life implementations of Paxos can be found at the heart of … Google’s magnificent Spanner database…<p>I'm not sure about Spanner and Paxos. Sebastian Kanthak said during his Google Spanner talk:<p>"If you've been to the Raft talk this morning, our Paxos implementation is actually closer to the Raft algorithm than to what you'd read in the Paxos paper, which is… if you haven't read it, don't read it, it's horrible." (at 7:43)<p><a href="http://www.infoq.com/presentations/spanner-distributed-google" rel="nofollow">http://www.infoq.com/presentations/spanner-distributed-googl...</a>
Nice animations.<p>How do you keep a broken or hostile node from advancing the sequence number to the end of the sequence number space?<p>There's an algorithm from one of Butler Lamson's grad students at MIT which fixes this, but it seems to require one more message per cycle. (<a href="http://pmg.csail.mit.edu/~castro/thesis.pdf" rel="nofollow">http://pmg.csail.mit.edu/~castro/thesis.pdf</a>) That paper later appears as a Microsoft Research paper on how to make an NFS-like file system with this consensus properly. Did Microsoft ever put that in a product?
Having looked for a few minutes, it really reminds me of the routing protocols used for distributing routes in networks. (Also Layer 2 stuff IIRC). There you also find heartbeats, elections, etc.<p>Is there a connection?<p>Also, does it have anything to do with the Byzantine Generals Problem?
> Side note: it’s important that no two proposers ever use the same sequence number, and that they are sortable, so that they truly reference only one proposal, and precedence between proposals can be decided using a simple comparison.<p>They are moving the core problem into a different domain. Worst explanation of PAXOS ever... nice animations though.<p>Edit: 'Worst explanation' is just an exaggeration, obv. It is nice, but doesn't explain really important issues.