The network is not reliable, but usually the cost of manually fixing problems arising from infrequent types of instability is less than the cost of pre-emptively addressing the issue.<p>As a practical example, our preferred HA solution for MySQL replication has effectively no network partition safety - if a network becomes partitioned, we'll end up with split brain. However, we have not once had to deal with this specific problem in our years of operation on hundreds of servers.<p>That said, do make the assumption that your AWS instances will be unable to reach each other for 10+ seconds on a frequent basis. Your life will be happier if you've already planned for that.
Takeaways:<p>* Network partition tolerance can be designed around, assuming infinite time and money<p>* Network partition tolerance depends on the application<p>* Mitigating potential failure requires having a very long view on very fine details<p>* Most organizations will not be able to engineer solutions to address all network partition-related outages
Great article. A lot of engineers don't have personal experience with these kinds of network failures, so sharing stories of their consequences means more engineers can make informed (and conscious) decisions of how much risk can be tolerated for their applications.<p>One thing that you could gleam for this article-and I think that this is incorrect-is that the application or operations engineer is responsible for understanding the nuances of distributed systems. In my experience the number of people who are relying on distributed systems is much larger than the number of people who understand these issues.<p>So what we really need are systems we can build on whose developers understand how to build (and test!) the nuances of data convergence, consensus algorithms, split-blain avoidance, etc. We need systems to gracefully-and automatically-deal with and recover from network failures.<p>Full disclosure: I'm an engineer at FoundationDB
I feel like the authors (or someone else) can do a lot more justice to their overall objective (i.e. tease out patterns) by applying some kind of a qualitative content analysis of case studies [0].<p>[0] <a href="http://www.qualitative-research.net/index.php/fqs/article/view/75/153January%202006" rel="nofollow">http://www.qualitative-research.net/index.php/fqs/article/vi...</a>
There was some discussion on a preliminary version of this article/blog-post[0] last year: <a href="https://news.ycombinator.com/item?id=5820245" rel="nofollow">https://news.ycombinator.com/item?id=5820245</a><p>[0] <a href="http://aphyr.com/posts/288-the-network-is-reliable" rel="nofollow">http://aphyr.com/posts/288-the-network-is-reliable</a>
Related reading on data structures that make availability easier to maintain under network partition: <a href="http://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/" rel="nofollow">http://writings.quilt.org/2014/05/12/distributed-systems-and...</a>