In many projects Celery is overkill. Common scenario I saw:<p><pre><code> 1. We have problem, lets use Celery
2. Now we have one more problem.
</code></pre>
I found <a href="http://python-rq.org/" rel="nofollow">http://python-rq.org/</a> much more handy and cover most cases. It uses redis as query broker. Flask, Django integration included <a href="https://github.com/mattupstate/flask-rq/" rel="nofollow">https://github.com/mattupstate/flask-rq/</a> <a href="https://github.com/ui/django-rq" rel="nofollow">https://github.com/ui/django-rq</a>
Good, basic practices to follow. Here's a few more:<p>- If you're using AMQP/RabbitMQ as your result back end it will create a lot of dead queues to store results in. This can easily overwhelm your RabbitMQ server if you don't clear these out frequently. Newer releases of Celery will do this daily I think - but it's worth keeping in mind if your RMQ instance falls over in prod.<p>- Use chaining to build up "sequential" tasks that need doing instead of calling one after another in the same task (or worse, doing a big mouthful of work) in one task as Celery can prioritise many tasks better than synchronously calling several tasks in a row from one "master" task.<p>- Try to keep a consistent module import pattern for celery tasks, or explicitly name them, as Celery does a lot of magic in the background so task spawning is seamless to the developer. This is very important as you should never mix relative and absolute importing when you are dealing with tasks. from foo import mytask may be picked up differently than "import foo" followed by "foo.mytask" would resulting in some tasks not being picked up by Celery(!)<p>- Never pass database objects, as OP says, is true; but go one step further and don't pass complex objects at all if you can avoid it. I vaguely remember some of the urllib/httplib exceptions in Python not being serializable and causing very cryptic errors if you didn't capture the exception and sanitise it or re-raise your own.<p>- Use proper configuration management to set up and configure Celery plus what ever messaging broker/backend. There's nothing more frustrating than spending your time trying to replicate somebody's half-assed Celery/Rabbit configuration that they didn't nail down and test properly in a clean-room environment.
I've worked 4+ years with Celery on 3 different projects and found it incredibly difficult to manage, both from the sysadmin and the coder point of view.<p>With that experience, we wrote a task queue using Redis & gevent that puts visibility & tooling first: <a href="http://github.com/pricingassistant/mrq" rel="nofollow">http://github.com/pricingassistant/mrq</a><p>Would love to have some feedback on that!
I disagree with the characterization in #1 (although I can't speak to the Celery particulars). I feel like if you have a job that is critical to your business process, the job should be persisted to your database and created within the same database transaction as whatever is kicking off the job.<p>Consider how background jobs are typically managed with RabbitMQ, Redis, etc. They are usually created in an "after commit" hook from whatever gets persisted to your relational database. In this scenario, there is a gap between the database transaction being committed and the job being sent to and persisted by RabbitMQ or Redis; during this gap the only record of that task is being held in a process's memory.<p>If this process gets killed suddenly during this gap, that background job will be lost forever. It sounds unlikely, but if RabbitMQ or Redis is down and the process has to sit and retry, waiting for them to come back online, the gap can be sizable.
I would add:<p>1. Use task specific logging if you have a bunch of task: <a href="http://blog.mapado.com/task-specific-logging-in-celery/" rel="nofollow">http://blog.mapado.com/task-specific-logging-in-celery/</a><p>2.Use statsd counters to keep track of basic statistics (counts + timers) for each task<p>3. Use supervisor + monit to restart workers after lack of activity (I have seen this happen a few times, but never been able to track down why it happens, but this is an easy fix)
Excellent resource, I remember wrestling with learning celery and how to do some simple things, loved finding Flower to monitor things.<p>I will say though Celery is probably overkill for a lot of tasks people think to use it for, in my case it was mandated to support scaling for a startup that never launched, partly because they kept looking at new technologies for problems they didn't have yet.
Points 1 and 2 are only valid because the Celery database backend implementation uses generic SQLAlchemy. Chances are, if you are using a relational database, it's PostgreSQL. And it does have an asynchronous notification system (LISTEN, NOTIFY), and this system allows you to specify which channel to listen/notify on.<p>With the psycopg2 module, you can use this mechanism together with select(), so your worker thread(s) don't have to poll at all. They even have an example in the documentation.<p><a href="http://www.postgresql.org/docs/9.3/interactive/sql-notify.html" rel="nofollow">http://www.postgresql.org/docs/9.3/interactive/sql-notify.ht...</a><p><a href="http://initd.org/psycopg/docs/advanced.html#async-notify" rel="nofollow">http://initd.org/psycopg/docs/advanced.html#async-notify</a>
Once you scale your worker pool up beyond a couple of machines you need some sort of config management with Celery. We use SaltStack to manage a large pool of celery workers and it does a pretty good job.
This is not a Celery-specific tip, but as Celery also likes to "tweak" your logging configuration you can use <a href="https://pypi.python.org/pypi/logging_tree" rel="nofollow">https://pypi.python.org/pypi/logging_tree</a> to see what's going on under the hood.
I've been looking at Python tasks queues recently. Does anyone have experience on how Celery and rq stack up?<p>Rq is a lot smaller, more than 10x by line count. So if it works just as well, I'd go with the simpler implementation.
Passing objects to Celery and not querying for fresh objects is not always a bad practice. If you have millions of rows in your database, querying for them is going to slow you way down. In essence, the same reason you shouldn't use your database as the Celery backend is the same reason you might not want to query the database for fresh objects. It depends on your use case of course. Passing straight values/strings should be strongly considered too since serializing and passing whole objects when you only need a single value is not good either.
If you combine Celery with supervisord it's important to check the official config file[1]. At least two settings there are really important - `stopwaitsecs=600` and `killasgroup=true`. If you don't use them you might end up with a bunch of orphaned child Celery processes and your tasks might be executed more than once.<p>[1] <a href="https://github.com/celery/celery/blob/ee46d0b78d8ffc068d5b80e9568a5a050c61d1a8/extra/supervisord/celeryd.conf#L18" rel="nofollow">https://github.com/celery/celery/blob/ee46d0b78d8ffc068d5b80...</a>
Am I the only person who was genuinely disappointed that this wasn’t about the vegetable?<p>It’s a sadly under-rated ingredient! The flavor is subtle but unmistakable.
Wondering about something: if you need to have a long task (5s to 10s) in the background, or even longer, for an AJAX request, what should you rather do:<p>- use gevent + gunicorn, or Tornado, in order to keep a socket open while the worker is processing the task?<p>- use polling? (less efficient)<p>- use websockets (but then the implementation is perhaps a bit more complex)<p>can you do this simply using Flask?
As one the authors of taskflow I'd like to give a little shout-out for its usage (since it can do similar things as celery, hopefully more elegantly and easily).<p>Pypi: <a href="https://pypi.python.org/pypi/taskflow" rel="nofollow">https://pypi.python.org/pypi/taskflow</a><p>Comments, feedback and questions welcome :-)
I've heard so much about Celery but still have no clue when it would be used. Could someone give some specific examples of when you have used it? I don't really even know what a distributed task is.
I'd also add:
Be wary of context dependent actions (e.g. render_template, user.set_password, sign_url, base_url) as you aren't in the application/request context inside of a celery task.
Has anybody been able to make a priority queue (with a single worker) in celery?<p>Eg, execute other tasks only if there are no pending important tasks.
> when you have a proper AMQP like RabbitMQ<p>AMPQ = Advanced Message Queuing Protocol so it's wrong to say that a message broker is "an AMQP". Also, give Redis a try - it's much easier to set up and uses fewer resources.<p>We should probably talk about the elephant in the room when addressing newbies: the Celery daemon needs to be restarted each time new tasks are added or existing ones are modified. I got past that with the ugly hack of having only one generic task[1] but people new to Celery need to know what they're getting into.<p>[1]: <a href="https://github.com/stefantalpalaru/generic_celery_task" rel="nofollow">https://github.com/stefantalpalaru/generic_celery_task</a>