I'm working on the infrastructure of a startup (Whova). Our backend is fully in Django, so our background tasks are run with celery since that's the main tool for that in the python community and a lot of legacy code is built around that. We process millions of tasks per day, for various things: cpu-bound logic, io (email/push notifications, 3rd part api calls, ...), scheduled tasks, ... These tasks are split across 20 queues processed by multiple workers on 6 dedicated machines.<p>My celery experience so far has been quite awful to say the least.<p>From the infrastructure point of view, celery has been the less reliable component of our stack by far.
As others have mentioned, one problem is that celery is frequently used for what it is not meant to be. And that's true for us too. If you check the documentation, deep down, celery was originally designed for short lived tasks, cpu bound, but turned out to be used for long lived tasks. And starting to process long lived tasks is the root of many problems until you find the correct settings to make it work. Of course, these settings are either not documented, or the documentation is useless at best (it sometimes creates even more confusion). There are also very few good quality resources online regarding this.<p>For several months, we dealt with celery workers getting stuck and not processing tasks, celery workers running out of memory, ... until we found the correct solution. And even now, we still have some random issues we have difficulty to track down due to the poor quality of monitoring around celery.
It actually made me smile a few weeks ago when the engineering team of DoorDash released a blog article about celery in which they mentioned several issues we encountered, including some they still have no clue but managed to mitigate (in particular, the stuck celery queue: they need to use -Ofair to fix the scheduling algorithm!) [1]<p>It's also very easy for developers to make mistake with celery: celery routing in Django is messy (routing of individual tasks and scheduled tasks), adding new queues need some coordination upon deployment until you automate it, generating too many scheduled tasks can make your workers run out memory, ... Celery definitely requires a solid training for all the engineers that will work with it. To be fair, this is very likely to be a true for any backgroubd processing tools: it usually is a critical part of the tech stack, but resources/training about that are less.<p>We are still using Celery 3. We few months ago, when they released celery 4, we looked into upgrading, but it was way more work that expected as the entire configuration syntax was broken. The testing needed to deploy that to production was not worth the shot, especially when factoring it took us months to find some tricky settings to get celery to finally be somewhat stable, so why risk losing that. Now, they already are at celery 5.0, and they plan to release even more breaking updates: seriously, WTF! And if you try to report issues but you use celery 3, you'll just be told to upgrade.<p>To be frank, I believe celery is a good project. They aren't many alternatives in python anyway. But they don't seem to listen to what their users need. It really seems that there is a gap between what they expect people to do with celery and what people do with it. I understand it's hard to provide a good default configuration suiting everyone, but then provide the appropriate documentation about how you can tune celery based on your use case, or clearly state the intented use case and limitations. So, the last thing we need is more breaking versions with more uncertainty about celery, but more documentation!<p>If they really go on that path, it's clear that we will eventually ditch celery for something else. Celery, from our experience, is not production friendly unless you put major efforts into it, or unless your project is fairly simple.<p>[1] <a href="https://doordash.engineering/2020/09/03/eliminating-task-processing-outages-with-kafka/" rel="nofollow">https://doordash.engineering/2020/09/03/eliminating-task-pro...</a>