Couple years ago I was looking at building highly available database (MySQL in particular), and looked into the multi-master setup. While sounds good on paper, its benefits don't warrant the high development and operation cost.<p>- The tables need to be changed and the application layer needs to be changed to support it, which is a big hassle and very fragile. It's easy to introduce update conflict. It's a nightmare when dealing with group of updates in a transaction. You can't really roll back a transaction at the replicated nodes.<p>- Whenever a node fails, the replication ring is broken and updates pile up at the previous node, while subsequent nodes' data become stale. It requires immediate human attention to fix it, which defeats the purpose of a HA cluster.<p>- Related to above. It's very difficult to add a new master node without stopping the cluster. The "catchup" process is very manual and fragile.<p>- Data in different node becomes stale under high replication load. Clients reading different masters would get stale data. They are supposed to be masters and got stale data?!<p>- Multi-master doesn't help write scalability as all; all nodes need to handle all writes. MySQL's single thread update in replication doesn't help. For read scalability, master-slave is better.<p>I abandoned the design after a while and chose a different approach. I ended up using a disk-based replication, like DRDB. A two-machine cluster forms the master cluster, one active and one standby. Writes are replicated to both machines at disk level synchronously. When the active node fails, the standby becomes active within seconds automatically with the complete data on disk.<p>The beauty of this approach is the simple design and setup. The data are always in sync, no stale data. Failover is immediate and automatic. The failed node can automatically catch up when back online. The database application doesn't need any change and all the SQL semantics are preserved. The cluster has one IP so the clients don't need special connection logic. They just need to retry when connection fails.<p>For disaster recovery, I built another two-machine cluster in another datacenter acting as the slave, which did async replication from the master cluster. When the two-machine master cluster completely failed (as in the datacenter got burnt down), the slave cluster can become master via a manual process within 30 minutes. The 30 minutes SLA is for someone got paged, look at the situation and decide to fail over. There are too many uncertainties across datacenters to fail over automatically.<p>Added bonus, slaves can still hang off the master cluster for read scalability. And it works with any disk-based databases, not just MySQL.