What I wish someone would have told me about using RabbitMQ (2020)

188 点作者 rdoherty将近 3 年前

23 条评论

Well written story... also:> I have personally experienced network partions happening in two ways: all nodes in the cluster being updated at the same time through Windows update and firewall rules.I'd rather try to build a four story brick building with horse dung for masonry mortar, than run my backend on windows boxes. How many cumulative hours of horror has Windows Update unleashed on the human race?

评论 #32092489 未加载

评论 #32094862 未加载

评论 #32093093 未加载

评论 #32093115 未加载

评论 #32094226 未加载

评论 #32093104 未加载

评论 #32094230 未加载

评论 #32097718 未加载

评论 #32093179 未加载

66fm472tjy7将近 3 年前

We use RMQ for most of our asynchronous processing. In most cases, we get a HTTP call and publish a message to the RMQ after committing the DB transaction, then we send the response to the HTTP client.We found out the hard way that RMQ does not behave like a transactional DB. Just because publishing worked does not mean the message will be delivered.Our solution is to also write the message into an outbox table in the DB. We then publish the message using confirms[0]. RMQ asynchronously sends us a confirmation when it has really persisted the message. We then delete the outbox entry. If we do not receive the confirmation in time, a timer will re-publish the message.Therefore I disagree with the suggestion of using a library wrapping the native RMQ one. We are using spring-amqp and this made it harder to understand what is going on. In the end, for a large project you will have to understand nuances of RMQ (and other infrastructure you are using). Using a leaky abstraction over it means you now have to understand both the underlying product and the abstraction.[0] <a href="https://www.rabbitmq.com/confirms.html#publisher-confirms" rel="nofollow">https://www.rabbitmq.com/confirms.html#publisher-confirms</a>

评论 #32092898 未加载

评论 #32093223 未加载

评论 #32102213 未加载

评论 #32092819 未加载

评论 #32092799 未加载

评论 #32093517 未加载

lmarcos将近 3 年前

At previous jobs while wearing the hat of "senior software engineer", management always expected from teams the now common "you build it you run it" way of working. The seniors of the team were supposed to know (or to learn) all the nuances of the parts that composed their infra: redis, rabbitmq, postgres, go, docker, k8s, github ci/cd, oauth,... Of course, senior software engineers should also know all the stuff related to software engineering (DS, algorithms, system design, solid, etc.)I ended up burnt out. Too much to learn, and I'm not really interested in most of the infra tooling we were using (I loved redis and got to know tons of it, on the other hand I never really cared about k8s or rabbitmq). Sadly, more often than not, companies out there still expect the mentality of "you build it you run it" which implies acquiring knowledge of all the little things in your infra that can go wrong in production. Damn it.

评论 #32094771 未加载

评论 #32101619 未加载

评论 #32102509 未加载

评论 #32097034 未加载

kelnos将近 3 年前

I get that distributed messaging/queuing is difficult (been there, done that, often didn't do a great job of it), but the constraint that every node in the cluster has to be running the exact same version of RabbitMQ is ridiculous. I can't see how you could ever orchestrate zero-downtime upgrades. Requiring that they're all on the same major version sounds reasonable, with a further constraint that clients can only use features supported by every node in the cluster (e.g. if a new feature was introduced in 1.3.0, and some nodes are still running 1.2.x, clients shouldn't use that new feature until all nodes have been upgraded). And there should still be some sort of reasonable migration process to the next major version! It may not be a simple migration process, but should at least be something where you can orchestrate things such that you have no downtime.Having the default behavior during a network partition be "whatever, just chug along as if nothing is wrong" is bonkers. Yes, the person who first set up the cluster should have read the documentation and gone over the configuration file line by line to see what might need changing, but... damn, that's a terrible default. Sure, some people's applications might value availability over consistency, but that's not the safest choice that follows the principle of least surprise.Using a higher-level library to interact with the cluster is really good advice, in general. We used Kafka at my last company, and colleagues who actually knew what they were doing wrote a (simple) wrapper library that set things up properly for our cluster so clueless people (such as myself) could write producers and consumers without having to understand what all the fiddly connection setup settings did, and how to handle various edge-case errors. Before that, quite a few outages were due to producer/consumer misconfigurations.Also I can't imagine running this kind of infra on Windows servers. That sounds like a self-inflicted wound (by "self" I mean the company, not OP specifically). And the idea of Windows Update running on a prod server ruining your day... what? IMO infra should be as immutable as possible. Patching/updating software on a machine should be a matter of spinning up a new machine with an already-updated image (that you've built and tested elsewhere), bringing those new machines into a cluster, and then decommissioning the old ones. When colos and dedicated servers were all the rage, that was difficult (and sometimes impractical), but in this day and age, with on-demand cloud provisioning, there's no excuse for companies that can use that sort of infra.

评论 #32092551 未加载

评论 #32092483 未加载

评论 #32092378 未加载

jillesvangurp将近 3 年前

I'd say this is a good introduction into reasons why self hosting clustered software is not something to do lightly. He's basically running rabbit mq on Windows. Yikes. And also, why? I'm sure it can be done responsibly. But allowing all nodes to self update and reboot randomly sounds like amateur hour to me (that actually happened apparently).So, don't do that. I don't use rabbit mq currently but we do use Elasticsearch, which has similar clustering capability and used to be more susceptible to split brain situations (been there, dealt with that)These days, I recommend using Elastic Cloud and avoid self hosting it. It's only cheaper until the first time you have to deal with a split brain cluster because you botched an update, mis-configured it, etc. One of the nice features in Elastic cloud is that you can click an update button and it will orchestrate a rolling restart. If you don't know what that is, you should not be operating a cluster of any kind.I'm sure there are similar cloud based services for rabbitmq. Probably well worth the money unless your in house ops team is super experienced with operating it (which clearly wasn't the case here). Such a team would cost you many hundreds of thousands of dollars per year. One person does not cut it. You need at least a 3 or 4 so you can afford them taking vacations, sick leave, leaving, or dying in some tragic accident. Half a million pays for some pretty nice cloud based clustering capacity. A good team will cost you more.For reference, we pay about 70/month for a tiny Elastic cloud search cluster that is actually good enough. I can double the price and capacity with a simple slider and it would still be cheap. One hour of my time is more than that with my normal freelance rate. My monthly rate would pay for an enormous cluster that far exceeds anything we need and it would still be cheaper than making sure we have four people with my skill set in the team (we don't) able and willing to look after it at all hours. The largest cluster I've ever dealt with was a self hosted monster that could index billions of documents per hour (millions per second). You so much as looked wrong at it, all hell would break loose. That cluster was one of several managed by a very experienced ops team that probably cost millions per years. That's the price of doing business at scale when self hosting.Most companies running into trouble with ES cut corners on doing it right and then pay the price with preventable outages, scaling issues, technical debt, etc. Cloud based services don't completely prevent this but if you know what you are doing, they provide a nice level of safety and risk mitigation.

评论 #32093495 未加载

评论 #32092663 未加载

评论 #32093050 未加载

benjaminwootton将近 3 年前

Messaging platforms have been an ever present in my career. Tibco, IBM MQ, back to Tibco, RabbitMQ then Kafka.I like how on the surface they can be quite simple to use, but the optimisation and management of them can be fiendishly subtle.Some of my most interesting projects have been trying to squeeze more messages through a pipe with lower latency, changing the way messages are sent and flow through these platforms, or digging into why one in a billion messages are dropped. There was also an interesting phase of trying to containerise and infra-as-code Kafka.It’s all like plumbing for data infrastructure. An interesting corner of the IT industry.

评论 #32092356 未加载

评论 #32094268 未加载

brundolf将近 3 年前

Genuine question from someone who doesn't know any better: what's the advantage of having a message queue like this (service publishes message to queue, recipient gets notified about it, responds to payload) vs just sending an HTTP request directly to the recipient?

评论 #32093245 未加载

评论 #32092635 未加载

评论 #32093694 未加载

评论 #32092629 未加载

评论 #32093174 未加载

评论 #32093272 未加载

评论 #32100072 未加载

评论 #32092619 未加载

评论 #32092889 未加载

评论 #32093550 未加载

mkl95将近 3 年前

> The time is 4:45 AM. I pull it together to realize it’s a call from a number I do not know - never a good sign. I answer and it is a coworker - my peer who runs our support team that is engaged in nearly all production issues for our customersUgh. I hope OP is making a boatload of money.

评论 #32093383 未加载

Sin2x将近 3 年前

With RabbitMQ it's also important to understand the difference between mirror and quorum queues: <a href="https://www.rabbitmq.com/quorum-queues.html" rel="nofollow">https://www.rabbitmq.com/quorum-queues.html</a>

评论 #32094621 未加载

gigatexal将近 3 年前

Well written and harrowing tale of the CAP theorem striking at 4:45AM.

doommius将近 3 年前

Yup. Did my masters thesis on Jepsen tests and there's a huge discrepancy on how most people perceive databases, consistency models and their actual behavior. Also just driver implementations, cluster configuration and other oddities that might bring chaos and disaster out of nowhere.

评论 #32093573 未加载

评论 #32103912 未加载

mxben将近 3 年前

I can appreciate the points about not knowing enough to engage an expert early or using a good wrapper library, etc. But this point blows my mind:> There’s this Network Partition thing, it’s kind of a big dealHow could someone using RabbitMQ cluster not consider how the cluster would behave during a partition? This is exactly the kind of thing that should be tested in a safe environment before running the cluster in production.Testing for network partitions is not something one wishes someone else would have told. It is an essential responsibility for anyone in a software engineering role. Not doing some basic testing to understand partition scenarios before running a cluster (any type of cluster) in production is a disaster just waiting to happen.

codedokode将近 3 年前

> The typical action sequence is the user submits a request via the web application and the backend handles that message by adding a message to RabbitMQ. The consumer gets the message and makes a HTTP call to another web service to actually submit the request. From there, the polling logic takes over and subsequent messages on the queue each represent a polling attempt to retrieve the results. If a job has no results, the consumer places a message back on a queue so we can delay the next polling attempt by a (customer configurable) amount of time.This looks like an overengineering to me, unless I have missed something. For example, I don't understand this part: "the consumer gets the message and makes a HTTP call to another web service" - why cannot that web service pull the message directly?I would implement it like this: when a client submits a request, it is added into an SQL database, and RabbitMQ is used only to notify the consumer. The client gets back a secure Job Token that it can use for polling to get the status of the job. The consumer reads the job from the database, executes it and updates status in the database. The client uses polling to know when the job is done.Of course, if you don't like polling, then you can use a WebSockets daemon that would notify a client when the job is done.

评论 #32093076 未加载

abraxas将近 3 年前

> Throughout that time we have scaled to 200+ concurrent consumers running across a dozen virtual machines while coordinating message processing (1 queue to N consumers) and processed hundreds of millions of messages in our .NET application.That's not a level of data volume that should require any kind of distributed messaging.

评论 #32094808 未加载

jazdw将近 3 年前

Is publishing a request to a queue and then polling for a response a typical pattern for a distributed web application?

评论 #32092727 未加载

评论 #32092453 未加载

评论 #32092471 未加载

danbulant将近 3 年前

As for the last point: everything that logs will eventually run out of space, be sure to have some kind of automation to rotate those logs. Had the same problem with MongoDB and QuestDB as well, and both had logs that were far bigger than the actual data in the database.

评论 #32094104 未加载

gspetr将近 3 年前

What he says the issue is: We should have consulted an expert before finalizing our architecture.What I see the issue is: We have exactly Nobody in our organization whose full-time job is to perform deep[0] testing.[0] <a href="https://www.developsense.com/blog/2022/01/testing-deep-and-shallow/" rel="nofollow">https://www.developsense.com/blog/2022/01/testing-deep-and-s...</a>"Risk coverage is how thoroughly we have examined the product with respect to some model of risk."There had been no modeling of any kind of risks related to networking issues with this technology, so there is no surprise that something went terribly wrong.

ikiris将近 3 年前

This reads like a rediscovery of the SRE book step by step.distributed clusters are cool, especially messaging ones, but you have to know how to manage them or you're adding eventual points of failure, not removing them.

samsquire将近 3 年前

I wrote some custom Chef code that handles a seamless Erlang and RabbitMQ upgrade of a cluster of RabbitMQ reliably. Sadly it's trapped at one company.There's probably a way to write some code to drain a split brain's data but I suspect nobody has the time to write this code.You could have a process where you block writers to the minority then drain messages and enqueue them on the majority.

codedokode将近 3 年前

Here is what I dislike about RabbitMQ: by default it opens several ports on external interfaces and doesn't provide an easy way to bind them to localhost only. To be specific, those are epmd port 4369, and RabbitMQ's ports 5672, 15672, and 25672. With help of Google I managed to move three of those ports to localhost, however the cluster communications port 25672 cannot be bound to localhost only. I found an issue on Github and developers' point is that it doesn't make sense to build a cluster from a single node, so you must have this port exposed to whole Internet so that any random stranger can connect to your RabbitMQ instance [1].So it seems that RabbitMQ wants to advertise how scalable it is by enabling clustering by default and accepting connections from anyone even though it is not secure. It is targeted at large corporations with giant clusters and doesn't care about developers who just need a single instance. Despite my guess that most developers actually do not have a volume of messages that would justify setting up a cluster.And as RabbitMQ is written in an exotic language (Erlang) I cannot even read the code.This is the problem not only with RabbitMQ, Elasticsearch also has clustering settings enabled by default, so when you try to run two independent instances, they connect to each other and start exchanging data and creating problems without you expecting it. So annoying and so difficult to disable. In earlier versions, as I remember, they would also try to scan local network and connect to any node they could find.[1] <a href="https://github.com/rabbitmq/rabbitmq-server/issues/1661" rel="nofollow">https://github.com/rabbitmq/rabbitmq-server/issues/1661</a>

评论 #32093099 未加载

评论 #32093484 未加载

评论 #32092919 未加载

评论 #32092945 未加载

Jedd将近 3 年前

Needs a [2020], but that suggests the author may well have an answer by now to their hypothetical question 'how do you propose to upgrade the cluster?'.Like a lot of highly complex Cool Tools they're marvellous right up until you hit an edge case or performance threshold or an odd failure state -- and then you find yourself copy-pasting increasingly trimmed-down log entries, desperately seeking people who've hit the same problem, or rather, people who've solved the same problem and thought to describe it on the Internet.If you're on fresh software, a fresh version, or just doing something mildly off-label, this can be a despairing process.

shudza将近 3 年前

This is why software engineers shouldn't handle this type of work, and why DevOps came to existence.

评论 #32094154 未加载

beckingz将近 3 年前

Classic issues with distributed systems.

23 条评论

isoprophlex将近 3 年前

评论 #32092489 未加载

评论 #32094862 未加载

评论 #32093093 未加载

评论 #32093115 未加载

评论 #32094226 未加载

评论 #32093104 未加载

评论 #32094230 未加载

评论 #32097718 未加载

评论 #32093179 未加载

66fm472tjy7将近 3 年前

评论 #32092898 未加载

评论 #32093223 未加载

评论 #32102213 未加载

评论 #32092819 未加载

评论 #32092799 未加载

评论 #32093517 未加载

lmarcos将近 3 年前

评论 #32094771 未加载

评论 #32101619 未加载

评论 #32102509 未加载

评论 #32097034 未加载

kelnos将近 3 年前

评论 #32092551 未加载

评论 #32092483 未加载

评论 #32092378 未加载

jillesvangurp将近 3 年前

评论 #32093495 未加载

评论 #32092663 未加载

评论 #32093050 未加载

benjaminwootton将近 3 年前

评论 #32092356 未加载

评论 #32094268 未加载

brundolf将近 3 年前

评论 #32093245 未加载

评论 #32092635 未加载

评论 #32093694 未加载

评论 #32092629 未加载

评论 #32093174 未加载

评论 #32093272 未加载

评论 #32100072 未加载

评论 #32092619 未加载

评论 #32092889 未加载

评论 #32093550 未加载

mkl95将近 3 年前

评论 #32093383 未加载

Sin2x将近 3 年前

评论 #32094621 未加载

gigatexal将近 3 年前

Well written and harrowing tale of the CAP theorem striking at 4:45AM.

doommius将近 3 年前

评论 #32093573 未加载

评论 #32103912 未加载

mxben将近 3 年前

codedokode将近 3 年前

评论 #32093076 未加载

abraxas将近 3 年前

评论 #32094808 未加载

jazdw将近 3 年前

Is publishing a request to a queue and then polling for a response a typical pattern for a distributed web application?

评论 #32092727 未加载

评论 #32092453 未加载

评论 #32092471 未加载

danbulant将近 3 年前

评论 #32094104 未加载

gspetr将近 3 年前

ikiris将近 3 年前

samsquire将近 3 年前

codedokode将近 3 年前

评论 #32093099 未加载

评论 #32093484 未加载

评论 #32092919 未加载

评论 #32092945 未加载

Jedd将近 3 年前

shudza将近 3 年前

This is why software engineers shouldn't handle this type of work, and why DevOps came to existence.

评论 #32094154 未加载

beckingz将近 3 年前

Classic issues with distributed systems.