The exchange between Antirez and aphyr following the post about Redis sentinel is a fascinating comparison between two engineering approaches. Antirez makes a qualitative argument (<a href="http://antirez.com/news/56" rel="nofollow">http://antirez.com/news/56</a>, especially <a href="http://antirez.com/news/56#comment-910996445" rel="nofollow">http://antirez.com/news/56#comment-910996445</a>) about the behavior of the system in some 'real world' where complex network partitions are rare. On the other hand, aphyr made a much more theoretically sound argument (including using TLA+ to demonstrate the validity of his argumement) in his post (<a href="http://aphyr.com/posts/287-asynchronous-replication-with-failover" rel="nofollow">http://aphyr.com/posts/287-asynchronous-replication-with-fai...</a>).<p>Despite having a huge amount of respect for Antirez and Redis, I strongly believe that the approach aphyr took is the one we are going to need as we build larger and more complex systems on unreliable infrastructure. Our engineering intuition, as excellent as it may be for single-node systems, almost always fails us with distributed systems. To get around this, we need to replace intuition. The tools that aphyr uses, such as TLA+ and carefully crafted counterexamples and diagrams, are an extremely good start in that direction. Getting a computer (in this case TLA+'s model checker TLC) to exhaustively test a design specification is very powerful. Comparing those results to the ones that we expected is even more powerful.<p>The comment made by Metaxis (<a href="http://antirez.com/news/56#comment-905001533" rel="nofollow">http://antirez.com/news/56#comment-905001533</a>) on Antirez's second reply is very good. Especially:<p>> I think your attempt to differentiate formal correctness and real world operations is deeply flawed and amounts to asserting anecdote - that what you have observed to be common makes for good design assumptions and better trade off decisions.<p>> Allow me to counter: Real world operations will inevitably tend to approach formal correctness in terms of observed failure modes. In other words, over time, you are more and more likely to see edge cases and freak occurrences that are predicted in theory but happen rarely in practice.<p>This closely matches my own experience. Just because I don't believe a network can behave in a particular way, doesn't mean it won't. The real world is full of complex network partitions and Byzantine failures, and our systems need to be safe when they happen
Fantastic in depth article. In my opinion, the tests he perform show not so much drawbacks of the tested systems, but more the fact that defining them in terms of the CAP theorem can be misleading. For example, a CP system should, in case of a partition, wait until the partition is resolved, potentially forever. This is not useful in practice, where almost always some timeout is used. This is why, if I'm interpreting the results correctly, not even a Postgres running on a single node can claim to be CP.