Transactionally Staged Job Drains in Postgres

115 点作者 johns超过 7 年前

9 条评论

brandur超过 7 年前

(Author here.)I've taken fire before for suggesting that any job should go into a database, but when you're using this sort of pattern with an ACID-compliant store like Postgres it is so convenient. Jobs stay invisible until they're committed with other data and ready to be worked. Transactions that rollback discard jobs along with everything else. You avoid so many edge cases and gain so much in terms of correctness and reliability.Worker contention while locking can cause a variety of bad operational problems for a job queue that's put directly in a database (for the likes of delayed_job, Que, and queue_classic). The idea of staging the jobs first is meant as a compromise: all the benefits of transactional isolation but with significantly less operational trouble, and at the cost of only a slightly delayed jobs as an enqueuer moves them out of the database and into a job queue.I'd be curious to hear what people think.

评论 #15295540 未加载

评论 #15299620 未加载

评论 #15296452 未加载

评论 #15296476 未加载

heinrichhartman超过 7 年前

Content aside. I never saw a blog article, so carefully typeset as this one:- Font choices and sizes- TOC- Figures- Code samples... all look perfect. It even includes a carefully spaced initial.I'd love to be able to replicate this on my Jekyll blog. But looks like most of this is hand-crafted HTML/CSS: <a href="https://github.com/brandur/sorg" rel="nofollow">https://github.com/brandur/sorg</a>

评论 #15297932 未加载

memracom超过 7 年前

I think it is great that PostgreSQL is strong enough to allow people to build robust queuing systems, but I still think that you are better off in the long run to use a real message queuing system like RabbitMQ to do this job.Start out by running RabbitMQ on the same server as PostgreSQL but do limit its use of cores and RAM. Then when your business grows you can easily scale to a separate RabbitMQ server, to a cluster of MQ servers and to a distributed RabbitMQ service using clusters in multiple data centers with global queues synchronized using a RabbitMQ plugin.The benefit of using RabbitMQ is that you begin to learn how message queuing fits into a system architecture and that you will not run into corner cases and weird behaviors as long as you heed the advice of moving to a dedicated RabbitMQ server when your usage gets large enough.An additionally benefit is that when you learn how to integrate functionality by using a message queue (actor model) rather than a link editor, you can avoid the monolithic big ball of mud problem entirely and easily integrate both monolithic functions and microservices in your app.Background jobs are just one part of what a robust message queue gives you. In my opinion, the desire for background jobs is a design smell that indicates a flaw in your architecture which you can fix by adding a message queue system.

评论 #15297618 未加载

评论 #15300089 未加载

评论 #15297692 未加载

评论 #15297673 未加载

评论 #15298173 未加载

rraval超过 7 年前

<pre><code> loop do DB.transaction do # pull jobs in large batches job_batch = StagedJobs.order('id').limit(1000) if job_batch.count > 0 # insert each one into the real job queue job_batch.each do |job| Sidekiq.enqueue(job.job_name, *job.job_args) end # and in the same transaction remove these records StagedJobs.where('id <= ?', job_batch.last).delete end end end </code></pre> Isn't this essentially a busy loop? You can achieve something much more performant by using `LISTEN` and `NOTIFY` to fire an event every time a row is inserted.Then the enqueuer can do a preliminary scan of the table when it boots up and then just a `LISTEN` instead of polling the DB.

评论 #15296539 未加载

pnathan超过 7 年前

Interesting.I'm working on delivering a Postgres based job system right now; we cycle through states from an ENUM, landing eventually on a terminal state. Worker jobs (containers on a cluster) don't directly manipulate the state of the table, there's a controller system for that. Each controller in the (3-node) cluster has 2 connections to Postgres. Old jobs are DELETE'd when it's been "long ago enough".Prior to addressing deadlocks from doing too much per transaction, initial load testing for this system suggested that the database was not the bounding factor in the system throughput, but rather worker throughput. Initial load is estimated to be under 500/day (\yawn\), but pushing the load to 100K/day didn't alter the outcome, although it made the cluster admin mildly annoyed.One key reason I prefer to have the state machine switching / enum approach is that it's logically obvious. At a certain point, I am sure it'd have to change. I'm not sure how many concurrent mutations to separate rows a Postgres table can tolerate, but that serves as a hard upper bound.Author: what kind of volume do you tolerate with this kind of design?

bgentry超过 7 年前

This pattern would basically be a clean migration away from a pure Postgres queue if either table bloat or locking becomes a performance problem. You maintain the benefits of transactional job enqueueing while only slightly worsening edge cases that could cause jobs to be run multiple times.Just be sure to run your enqueueing process as a singleton, or each worker would be redundantly enqueueing lots of jobs. This can be guarded with a session advisory lock or a redis lock.Knowing that this easy transition exists makes me even more confident in just using Que and not adding another service dependency (like Redis) until it’s really needed.

评论 #15297998 未加载

评论 #15295987 未加载

njharman超过 7 年前

> by selecting primed jobs in bulk and feeding them into another store like RedisDoesn't this just mean bunch of lost jobs when redis fails.Why not keep jobs with job state wait, done, etc in the reliable ACID store.

评论 #15297404 未加载

sandGorgon超过 7 年前

this is so awesome. for a small team building infrastructure on the cheap, building background jobs on postgres is so much nicer than using more complex tools like rabbitmq, etc .are you planning on productizing this ?

meritt超过 7 年前

Sequence allocations occur globally and outside your transaction.<pre><code> StagedJobs.where('id <= ?', job_batch.last).delete </code></pre> This will end up deleting a job id that was reserved inside a transaction, meanwhile your enqueuer kicks off and fetches the jobs, then your transaction writes the job to staged_jobs table, just in time for enqueuer to delete it without ever queueing it.You need to delete the specifically queued ids and not a numeric range.