I get that distributed messaging/queuing is difficult (been there, done that, often didn't do a great job of it), but the constraint that every node in the cluster has to be running the <i>exact</i> same version of RabbitMQ is ridiculous. I can't see how you could ever orchestrate zero-downtime upgrades. Requiring that they're all on the same major version sounds reasonable, with a further constraint that clients can only use features supported by every node in the cluster (e.g. if a new feature was introduced in 1.3.0, and some nodes are still running 1.2.x, clients shouldn't use that new feature until all nodes have been upgraded). And there should still be some sort of reasonable migration process to the next major version! It may not be a simple migration process, but should at least be something where you can orchestrate things such that you have no downtime.<p>Having the default behavior during a network partition be "whatever, just chug along as if nothing is wrong" is bonkers. Yes, the person who first set up the cluster should have read the documentation and gone over the configuration file line by line to see what might need changing, but... damn, that's a terrible default. Sure, some people's applications might value availability over consistency, but that's not the <i>safest</i> choice that follows the principle of least surprise.<p>Using a higher-level library to interact with the cluster is really good advice, in general. We used Kafka at my last company, and colleagues who actually knew what they were doing wrote a (simple) wrapper library that set things up properly for our cluster so clueless people (such as myself) could write producers and consumers without having to understand what all the fiddly connection setup settings did, and how to handle various edge-case errors. Before that, quite a few outages were due to producer/consumer misconfigurations.<p>Also I can't imagine running this kind of infra on Windows servers. That sounds like a self-inflicted wound (by "self" I mean the company, not OP specifically). And the idea of Windows Update running on a prod server ruining your day... what? IMO infra should be as immutable as possible. Patching/updating software on a machine should be a matter of spinning up a new machine with an already-updated image (that you've built and tested elsewhere), bringing those new machines into a cluster, and then decommissioning the old ones. When colos and dedicated servers were all the rage, that was difficult (and sometimes impractical), but in this day and age, with on-demand cloud provisioning, there's no excuse for companies that can use that sort of infra.