TechEcho

8 comments

RabbitMQ requiring a reliable network is causing problems for us in production. Anyone else struggled with this?We're running several clusters on different providers; one is Digital Ocean, another is on a partner's VMware vMotion-based system. The kernel gets soft lockups now and then (in the case of vMotion, when VMs are automatically migrated to other physical nodes), which causes RabbitMQ partitions. The lockups may last a few seconds, but I've seen minute-long pauses.When this happens, RabbitMQ starts throwing errors at clients. Queues disappear and clients can't do anything even though the local node is running. Although I understand the behaviour, that's not what I want from a queue; I want a local node to store messages until it rejoins the cluster and can dispatch them, and I want the local node to continue offering messages to those listening.Unfortunately, RabbitMQ's design doesn't allow this: Queues live on the node they were created on, and RabbitMQ does not replicate them. We have turned on autohealing, but I don't like the fact that minority nodes simply wipe their data when they rejoin the cluster. The federation plugin doesn't look like a great solution.I really like RabbitMQ, but maybe it's time to considering something else. Any suggestions? Something equally lightweight and "multimaster", but without the partition problem?

评论 #7864025 未加载

评论 #7864027 未加载

评论 #7866164 未加载

rdtscalmost 11 years ago

If you didn't read the article, the first part is related to this blog post:<a href="http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/" rel="nofollow">http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphor...</a>That post talks about how to build a distributed semaphore using RabbitMQ in the "clustered" configuration option.Aphyr's post is good but RabbitMQ make it pretty clear that this setup is not resilient to network partitions. They even mention it at the end of the blog post. That is also clear in the documentation.<a href="https://www.rabbitmq.com/distributed.html" rel="nofollow">https://www.rabbitmq.com/distributed.html</a>The big general question is how likely are you to encounter network partitions?Aphyr believes they are very dangerous and likely to happen to you at some point and should be taken more seriously by the programming world.I tend to take a more of "it depends" on your environment. Recently with VM and cloud deployment network partitions are quite likely. So they should be at the top of your "things to worry about" list. But, I haven't seen it happen on a local LAN. During testing I have induced it by hand (but pulling the network cable out from a switch), but I just haven't seen it happen otherwise. I probably got lucky. Now I have seen other things come like memory corruption, memory leaks in software, and so on. So many other things that worry me more than network partition on a LAN.All that said, it is great to see these experiments run. Please read and study all the other "Call me maybe" series. They are just very good. It turns out most products with a "distributed" component will fall over in the face of a partition. And keep in mind that it is best to have Aphyr discover these issues than your customers ;-)

评论 #7865057 未加载

评论 #7863934 未加载

评论 #7863940 未加载

评论 #7866446 未加载

keypusheralmost 11 years ago

I thought that in order to create a distributed locking system, you need to be able to reliably fence unreachable nodes. "Network Partitions" sound a lot like "Split Brain" to me. I am more familiar with traditional clustering solutions such as Pacemaker/Corosync and the GFS2 DLM than RabbitMQ, so perhaps I am missing something here?

评论 #7865048 未加载

Tomdarknessalmost 11 years ago

In relation to the second part of the article regarding lost messages it would be interesting to see the same test done but with HA queues:<a href="http://www.rabbitmq.com/ha.html" rel="nofollow">http://www.rabbitmq.com/ha.html</a>I might be wrong but I'd suspect that it might solve the problem as while the partioned nodes will still discard the data it should be present on the master node.

评论 #7864020 未加载

KayEssalmost 11 years ago

The seriousness of this depends on the consequences of the failure and how much of the time the lock is held. If the failure causes double charging a customer that's likely far more serious than if it double emails a customers. And if the lock spends most of its time not held then spotting a doubling of its message can probably be spotted and fixed far more easily than if the lock spends most of its time held.

EGregalmost 11 years ago

How does RabbitMQ compare with 0MQ?

评论 #7864138 未加载

nateburkealmost 11 years ago

Cool. I've been bitten by RabbitMQ before. When are you going to give cassandra the jepsen treatment?

评论 #7864031 未加载

评论 #7864057 未加载

mjamilalmost 11 years ago

A bit meta. That was a really well-explained description of a non-trivial problem. I wish there was (or I knew of) a curated collection of bloggers that wrote this well about software.

评论 #7865031 未加载

评论 #7864476 未加载

8 comments

lobster_johnsonalmost 11 years ago

评论 #7864025 未加载

评论 #7864027 未加载

评论 #7866164 未加载

rdtscalmost 11 years ago

评论 #7865057 未加载

评论 #7863934 未加载

评论 #7863940 未加载

评论 #7866446 未加载

keypusheralmost 11 years ago

评论 #7865048 未加载

Tomdarknessalmost 11 years ago

评论 #7864020 未加载

KayEssalmost 11 years ago

EGregalmost 11 years ago

How does RabbitMQ compare with 0MQ?

评论 #7864138 未加载

nateburkealmost 11 years ago

Cool. I've been bitten by RabbitMQ before. When are you going to give cassandra the jepsen treatment?

评论 #7864031 未加载

评论 #7864057 未加载

mjamilalmost 11 years ago

A bit meta. That was a really well-explained description of a non-trivial problem. I wish there was (or I knew of) a curated collection of bloggers that wrote this well about software.

评论 #7865031 未加载

评论 #7864476 未加载

Call me maybe: RabbitMQ

8 comments

Call me maybe: RabbitMQ

8 comments