RabbitMQ requiring a reliable network is causing problems for us in production. Anyone else struggled with this?<p>We're running several clusters on different providers; one is Digital Ocean, another is on a partner's VMware vMotion-based system. The kernel gets soft lockups now and then (in the case of vMotion, when VMs are automatically migrated to other physical nodes), which causes RabbitMQ partitions. The lockups may last a few seconds, but I've seen minute-long pauses.<p>When this happens, RabbitMQ starts throwing errors at clients. Queues disappear and clients can't do anything even though the local node is running. Although I understand the behaviour, that's not what I want from a queue; I want a local node to store messages until it rejoins the cluster and can dispatch them, and I want the local node to continue offering messages to those listening.<p>Unfortunately, RabbitMQ's design doesn't allow this: Queues live on the node they were created on, and RabbitMQ does not replicate them. We have turned on autohealing, but I don't like the fact that minority nodes simply wipe their data when they rejoin the cluster. The federation plugin doesn't look like a great solution.<p>I really like RabbitMQ, but maybe it's time to considering something else. Any suggestions? Something equally lightweight and "multimaster", but without the partition problem?
If you didn't read the article, the first part is related to this blog post:<p><a href="http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/" rel="nofollow">http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphor...</a><p>That post talks about how to build a distributed semaphore using RabbitMQ in the "clustered" configuration option.<p>Aphyr's post is good but RabbitMQ make it pretty clear that this setup is not resilient to network partitions. They even mention it at the end of the blog post. That is also clear in the documentation.<p><a href="https://www.rabbitmq.com/distributed.html" rel="nofollow">https://www.rabbitmq.com/distributed.html</a><p>The big general question is how likely are you to encounter network partitions?<p>Aphyr believes they are very dangerous and likely to happen to you at some point and should be taken more seriously by the programming world.<p>I tend to take a more of "it depends" on your environment. Recently with VM and cloud deployment network partitions are quite likely. So they should be at the top of your "things to worry about" list. But, I haven't seen it happen on a local LAN. During testing I have induced it by hand (but pulling the network cable out from a switch), but I just haven't seen it happen otherwise. I probably got lucky. Now I have seen other things come like memory corruption, memory leaks in software, and so on. So many other things that worry me more than network partition on a LAN.<p>All that said, it is great to see these experiments run. Please read and study all the other "Call me maybe" series. They are just very good. It turns out most products with a "distributed" component will fall over in the face of a partition. And keep in mind that it is best to have Aphyr discover these issues than your customers ;-)
I thought that in order to create a distributed locking system, you need to be able to reliably fence unreachable nodes. "Network Partitions" sound a lot like "Split Brain" to me. I am more familiar with traditional clustering solutions such as Pacemaker/Corosync and the GFS2 DLM than RabbitMQ, so perhaps I am missing something here?
In relation to the second part of the article regarding lost messages it would be interesting to see the same test done but with HA queues:<p><a href="http://www.rabbitmq.com/ha.html" rel="nofollow">http://www.rabbitmq.com/ha.html</a><p>I might be wrong but I'd suspect that it might solve the problem as while the partioned nodes will still discard the data it should be present on the master node.
The seriousness of this depends on the consequences of the failure and how much of the time the lock is held. If the failure causes double charging a customer that's likely far more serious than if it double emails a customers. And if the lock spends most of its time not held then spotting a doubling of its message can probably be spotted and fixed far more easily than if the lock spends most of its time held.
A bit meta. That was a really well-explained description of a non-trivial problem. I wish there was (or I knew of) a curated collection of bloggers that wrote this well about software.