This article is pretty thin. We are currently building out a clustered application on corosync/pacemaker with postgres synchronous replication and tomcat, and I have mixed feelings about Linux HA so far. It isn't too bad to get something basic set up (cluster with virtual IP for instance), but when things don't work it can be difficult to figure out why. If you are looking for a distributed filesytem GFS2 in this stack isn't bad. However, it seems like there are a lot of differences between package versions, and the interactions between versions of heartbeat, corosync, pacemaker, crm, pcs, your resource definition ocf files, the stonith resources, cluster-glue, along with the linux packages makes problems hard to track down and much of the web info you do find out of date. I've often had to resort to irc or the mailing list to try and figure things out, and even then sometimes it seems like nobody knows. The whole thing feels a little bit shaky at first, but it is possible to build a solid cluster on top of it with enough effort.
Here's my experience in building HA in Linux. There are two key pieces: storage replication and failure detection. Replication is so that there's a standby system with the same persistent state ready to go, and failure detection, well, the whole point of HA is to ensure ongoing operation to continue in case of failure.<p>For storage replication, Linux has the excellent DRBD (<a href="http://www.drbd.org/" rel="nofollow">http://www.drbd.org/</a>) software to replicate disk at the block device level. This is great because any kind of disk based systems can be supported, such as database server, mail server, file server, DNS server, etc.<p>For failure detection, Linux has the Linux HA Heartbeat (
<a href="http://www.linux-ha.org/wiki/Heartbeat" rel="nofollow">http://www.linux-ha.org/wiki/Heartbeat</a>). This would detect failure at machine level and ensure proper failover.<p>Within a machine, there are other tools to monitor process level failure and propagate the failure to Linux HA Heartbeat.<p>BTW, STONITH is a super simple way to avoid the partition problem.
A lot of tools are mentioned in both article and in this thread. So what is the simplest and best way to achieve a failover virtual ip assigned to cluster members? I don't want the tool to start services, it would be enough if it won't send the traffic to failed note by determining with simple logic like if port 80 is not listening? Having a lot of alternatives is good but confusing and having powerful tools is also good but when only simple things needs to be achieved, it requires a lot of time to configure it and it is harder while troubleshooting. I prefer "keep it simple stupid".
No mention of Wackamole¹ in the article, and I feel really compelled to mention it here, as it is really simple to set up a HA cluster using it. I followed this howto² a few months ago. It was really easy to configure and it runs stable since.<p>¹<a href="http://www.backhand.org/wackamole/" rel="nofollow">http://www.backhand.org/wackamole/</a><p>²<a href="http://www.howtoforge.com/setting-up-a-high-availability-load-balancer-with-haproxy-wackamole-spread-on-debian-etch" rel="nofollow">http://www.howtoforge.com/setting-up-a-high-availability-loa...</a>
Linux High Availability is certainly still used a lot within application infrastructures, especially in some of the smaller ones. However what I find more interesting are the architectures that are performing all of the availability functions in the software layer such as in the application or database code, as typically these are simpler and offer far better scalability.
Its not mentioned but ucarp is handy for those times when you want to float a vip between two boxes for a bit of redundancy but don't need something super intelligent (like where a bit of flapping is ok).<p><a href="http://www.ucarp.org/" rel="nofollow">http://www.ucarp.org/</a>