TechEcho

7 comments

falcolasalmost 11 years ago

The network is not reliable, but usually the cost of manually fixing problems arising from infrequent types of instability is less than the cost of pre-emptively addressing the issue.As a practical example, our preferred HA solution for MySQL replication has effectively no network partition safety - if a network becomes partitioned, we'll end up with split brain. However, we have not once had to deal with this specific problem in our years of operation on hundreds of servers.That said, do make the assumption that your AWS instances will be unable to reach each other for 10+ seconds on a frequent basis. Your life will be happier if you've already planned for that.

评论 #8163517 未加载

评论 #8163407 未加载

peterwwillisalmost 11 years ago

Takeaways:* Network partition tolerance can be designed around, assuming infinite time and money* Network partition tolerance depends on the application* Mitigating potential failure requires having a very long view on very fine details* Most organizations will not be able to engineer solutions to address all network partition-related outages

评论 #8164370 未加载

jrullmannalmost 11 years ago

Great article. A lot of engineers don't have personal experience with these kinds of network failures, so sharing stories of their consequences means more engineers can make informed (and conscious) decisions of how much risk can be tolerated for their applications.One thing that you could gleam for this article-and I think that this is incorrect-is that the application or operations engineer is responsible for understanding the nuances of distributed systems. In my experience the number of people who are relying on distributed systems is much larger than the number of people who understand these issues.So what we really need are systems we can build on whose developers understand how to build (and test!) the nuances of data convergence, consensus algorithms, split-blain avoidance, etc. We need systems to gracefully-and automatically-deal with and recover from network failures.Full disclosure: I'm an engineer at FoundationDB

blutootalmost 11 years ago

I feel like the authors (or someone else) can do a lot more justice to their overall objective (i.e. tease out patterns) by applying some kind of a qualitative content analysis of case studies [0].[0] <a href="http://www.qualitative-research.net/index.php/fqs/article/view/75/153January%202006" rel="nofollow">http://www.qualitative-research.net/index.php/fqs/article/vi...</a>

blutootalmost 11 years ago

There was some discussion on a preliminary version of this article/blog-post[0] last year: <a href="https://news.ycombinator.com/item?id=5820245" rel="nofollow">https://news.ycombinator.com/item?id=5820245</a>[0] <a href="http://aphyr.com/posts/288-the-network-is-reliable" rel="nofollow">http://aphyr.com/posts/288-the-network-is-reliable</a>

jchrisaalmost 11 years ago

Related reading on data structures that make availability easier to maintain under network partition: <a href="http://writings.quilt.org/2014/05/12/distributed-systems-and-the-end-of-the-api/" rel="nofollow">http://writings.quilt.org/2014/05/12/distributed-systems-and...</a>

KaiserProalmost 11 years ago

The head states that the network is reliable, but then goes on to list lots of cases where the network fails.

评论 #8163976 未加载

7 comments

falcolasalmost 11 years ago

评论 #8163517 未加载

评论 #8163407 未加载

peterwwillisalmost 11 years ago

评论 #8164370 未加载

jrullmannalmost 11 years ago

blutootalmost 11 years ago

jchrisaalmost 11 years ago

KaiserProalmost 11 years ago

The head states that the network is reliable, but then goes on to list lots of cases where the network fails.

评论 #8163976 未加载

The Network is Reliable

7 comments

The Network is Reliable

7 comments