Asynchronous Processing in Web Apps, Part 1: A Database Is Not a Queue

59 pointsby nesquenaover 12 years ago

11 comments

fdrover 12 years ago

I think the article is good, however, I feel compelled to weigh in from the countervailing general direction:I am a message queue skeptic. Not that they should never be used, but rather a general feeling that complex, dedicated message queue software is often used for engineering problems between two or three orders of magnitudes too small before they deliver value. And, for most projects, queue replacement is not too difficult, especially if one's use of a database-backed queue is relatively naive.As such, I will -- with an open mind -- suggest that general purpose database management software masquerading as queues is not an outright antipattern, and the times to use them in this way is probably more commonly seen than the opposite, where a dedicated queue delivers clear value.Here are my main reasons for thinking that, which almost entirely have something to do with being able to address one's queue and one's other data in the same transaction:* Performance: a pedantically unsound but basically reasonable rule of thumb (as experienced by me) would suggest that one needs to be processing somewhere between hundreds or thousands of messages per second with at least fifty (and maybe up to a hundred or two) parallel executors processing or emitting messages before there are performance issues where the lower constants of a dedicated queuing package become attractive. Below this kind of throughput, one starts to experience some more pain than gain.* Correctness: Most database + queue integrations do a lousy job of what is effectively two-phase commit between different data storage instances (so that would include two+ RDBMSes, even if they are of the same kind, e.g. 2xPostgreSQL), and frequently the queue has to be counted among these (exception: when the queue contents can be lost/can be rebuilt/is idempotent at all times). Systems do a lousy job of making this work because it's pretty finicky to do a good job in many situations, i.e. expensive and time consuming.* Constants, when dealing with other systems: When one does do a good job and has interesting requirements in the 'correctness' case, it often means doing forms of two-phase commit, whether explicitly supported by the system (e.g. PREPARE TRANSACTION) or a spiritual equivalent via carefully thought out state machines. In principle these could be relatively cheap, but typically to avoid complexity more expensive approaches are employed, such as a couple extra UPDATE requests to poke at some home-grown state machine.Also, my experience indicates that as systems evolve, there will be inevitable bugs in these state machines that, by nature, span systems. Be vigilant and make sure you get more value than pain, and try to avoid having too many of them.* HA is still hard: clustering is generally in principle possible, but make sure you read the fine print. For example, many people use Redis as a queue, but it is not really unlike any other monolithic database most generally -- its main draw as a vanilla queue is good execution-time constants. The same could be said of Apache ActiveMQ in its least byzantine configuration. One might think that one would get a lot of leverage 'for free' given the simpler semantics of queues vs the diversity of access methods in most general purpose databases, but so far I have not seen that to be the case, for the very good reason that a lot of people expect a lot of reliability out of their queues (no less than the transactional nature of some databases), and doing that is either most natural in a monolithic system or slow, or complicated, or both in a multi-master distributed system.All in all, if you think you need dedicated queuing software to send a few dozen emails a second (that's a lot of email for most people!), think twice: it might still be a good idea, but brace yourself for these pitfalls or convince yourself that they probably mostly don't apply to you.

评论 #4788394 未加载

评论 #4788049 未加载

评论 #4788548 未加载

icebrainingover 12 years ago

With a traditional database this typically means a service that is constantly querying for new processing tasks or messages.Traditionally, but not necessarily; PostgreSQL supports the LISTEN and NOTIFY commands for asynchronous notifications, without polling.

评论 #4790430 未加载

评论 #4788829 未加载

评论 #4789951 未加载

lsh123over 12 years ago

1) The article claims that queues is a replacement for a database thus it solves all the problems with DB solution. In reality, if you have to have reliability and data integrity guarantees, then a messaging queue system will be using a DB under the hood (in the best case, it will be off-the-shelf DB, in the worst, something homegrown). All the same problems and considerations apply. You run into the same DB problems with the need to manually optimize queue tables, etc. because message queues rarely do a good job there. Of course, all the commit issues discussed above apply to.2) The issue with polling is easily solved by using modern DBs (e.g. Postress with listen/notify) or just plain old triggers/stored procedures. Yes, it will be a custom solution but it will be simpler from operational perspective than a "generic" do-it-all solution.3) Lastly, it all depends on the task. At WePay we use gearmand backed by a DB. We have seen it working really well and helping us handle 1000x spikes from normal load w/o a single problem. However, we did quite a few modifications to get there from the "stock" code to customize it to our use-case and have 200% data integrity and reliability guarantees.

评论 #4788980 未加载

PaulHouleover 12 years ago

I dunno. I've seen companies go through 4 or 5 different message queue systems and find that none of them work quite right.12 years or so ago I developed a few systems that used qmail as a message queue and I was pretty happy with that.Asynchronous processing is a necessary evil, but I think a lot of people underestimate the difficulty. It's one thing to compress a video in the background, but if you have one asynchronous task that spawns a bunch of asynchronous tasks and they spawn asychronous tasks and someday they all come together... Well maybe they come together someday. There's a definite "complexity barrier" you hit when asynchronous applications rapidly become harder to maintain.There are ways around this, but I've frequently seen MQ-based systems that never get "done".

评论 #4788315 未加载

jhartmannover 12 years ago

Nice article. One thing you should talk about in my opinion are systems like Redis. Redis can be used as a generalized key value store, but it can also be used as a messaging platform. In fact many systems like Storm (which would be another great topic) have easy integration with redis pub subs. While Redis and other solutions like it probably are not a good fit for all your data, they are great for mixed supporting data, caching and messaging. There are also really nice integrations if you like Java, these days you can make Spring message driven beans to consume messages from a redis pub sub very easily with minimal configuration. ZeroMQ is probably another technology that is worth discussing, either using it with a layer like Storm on top or by itself. Not every messaging system has to be heavyweight and cumbersome like the good ole' days.

评论 #4790460 未加载

blcArmadilloover 12 years ago

Look forward to reading the future articles. Right now I'm working on a side project that has a need for some asynchronous tasks. I was planning on using beanstalkd but the one thing that concerns me is that if the queue goes down the outstanding jobs are not persisted. Any recommendations on the best way around this?

评论 #4790440 未加载

评论 #4790383 未加载

aidosover 12 years ago

Nice clear introduction.You can write a simple Async DB system without running into deadlocks. Deadlocks tend to occur when you have 2 processes trying to lock 2 different resources in differing orders. Not a case you'll run into here.Also, you could do something like (correct me if I'm wrong, but I think it should be safe):<pre><code> update queue set owner = '1_time_use_rand_key' where owner is null order by id limit 1; </code></pre> Not that I'm advocating this, you should definitely use a proper system to manage these types of tasks! :)

matticakesover 12 years ago

disclaimer: co-author of NSQ [1] hereAgreed. Message queues play an important role for us (bitly) in being a layer of fault tolerance, buffering, and a means to perform various operational tasks.They're so important to us that we decided to build something that worked exactly like we wanted.NSQ is a realtime distributed message processing system where we've taken the approach of focussing on making it ops friendly and easy to get started.IMO, solutions that make it easy to develop on and administrate are most important... because things break. NSQ is straightforward to deploy (limit dependencies), simple to configure (runtime discovery), and client libraries provide a lot of functionality important for handling failures (like backoff, deferred messages, etc.) for a variety of use cases.We've written an in-depth introductory blog post [2] about NSQ that has more details.[1]: <a href="https://github.com/bitly/nsq" rel="nofollow">https://github.com/bitly/nsq</a> [2]: <a href="http://word.bitly.com/post/33232969144/nsq" rel="nofollow">http://word.bitly.com/post/33232969144/nsq</a>

评论 #4790450 未加载

randomtreeover 12 years ago

If you would consider using NoSQL, then MongoDB might be a good fit. It has a messaging queue. It's called capped collections with tailable cursors (<a href="http://www.mongodb.org/display/DOCS/Tailable+Cursors" rel="nofollow">http://www.mongodb.org/display/DOCS/Tailable+Cursors</a> ). It's persistent, you don't need polling, and you don't need to remove processed messages.

eternalbanover 12 years ago

An in-(or out of) process (async) driver (exposing an API supporting future semantics) can provide DB specific queueing to some extent. MQs and MOMs are very useful but not always necessary and the additional latency due to node hops is sometimes not acceptable.

kamakazizuruover 12 years ago

great post! this is what I wish I had as a starting point when I was learning how to get web apps to do work smarter rather than diving straight into celery documentation cause someone told me that would be the way to go!