Zero downtime Postgres migration, done right

383 pointsby rigaspapasalmost 4 years ago

20 comments

chtituxalmost 4 years ago

If you can afford a one off 1 second of latency for your SQL queries, then using logical replication with pgbouncer seems way easier :- setup logical replication between the old and the new server (limitations exist on what is replicated, read the docs)- PAUSE the pgbouncer (virtual) database. Your app will hang, but not disconnect from pgbouncer- Copy the sequences from the old to new server. Sequences are not replicated with logical replication- RESUME the pgbouncer virtual database.You're done. If everything is automated, your app will see a temporary increase of the SQL latency. But they will keep their TCP connections, so virtually no outage.

评论 #27476388 未加载

评论 #27479005 未加载

评论 #27478880 未加载

评论 #27476562 未加载

评论 #27477641 未加载

评论 #27479047 未加载

评论 #27480667 未加载

评论 #27478368 未加载

luhnalmost 4 years ago

Braintree (IIRC) had a really clever migration strategy, although I can't seem to find the blog post now. They paused all traffic at the load balancer, cut over to the new DB, and then resumed traffic. No requests failed, just a slight bump in latency while the LBs were paused.This app apparently had robust enough retry mechanisms that they were able just eat the errors and not have customer issues—Color me impressed! I'm not sure how many teams can make that claim; that's a hard thing to nail down.

评论 #27475473 未加载

评论 #27477400 未加载

评论 #27476907 未加载

dkhenryalmost 4 years ago

This is one of the areas where Postgres is so far behind MySQL is embarrassing. Zero downtime migrations in MySQL have been a common method for over 10 years. This solution is far from ideal due to the use of Triggers which can greatly increase the load on the database and slow down transactions. If you don't have a lot of load on your DB thats fine, but if you are pushing your DB this will bring it down.In MySQL they started with triggers with PT-OSC, but now there is GH-OST which does it with Replication. You can do something like this with Postgres by using Logical replication, but its still requires hand holding, and to my knowledge there is no way to "throttle" the migration like you can with GH-OST. Where I work now we are building all this out so we can have first class online migrations, but the chasm is still pretty big.

评论 #27480464 未加载

评论 #27479670 未加载

评论 #27480125 未加载

评论 #27479485 未加载

tbrockalmost 4 years ago

Its insane that it has to be this complex and require third party software to accomplish…Most modern rdbms/nosql database vendors allow a rolling upgrade where you roll in new servers and roll out the old ones seamlessly.Also the fact that AWS rds doesnt do this with zero downtime by default through automating it this way is also crazy. Why pay for hosted when the upgrade story is incomplete? Take downtime to upgrade a DB in 2021? Everyone must be joking.

评论 #27477563 未加载

评论 #27477690 未加载

评论 #27513166 未加载

评论 #27478636 未加载

ivan888almost 4 years ago

I didn't read this article, but I really hate the tag line "done right". It expresses such a poor sense of humility, which is one of, or perhaps the most, important traits in the world of software

评论 #27479116 未加载

评论 #27513186 未加载

jlmortonalmost 4 years ago

If you are in AWS, or have connectivity available, AWS Database Migration Service makes this relatively trivial.DMS for Postgres is based on Postgres Logical Replication, which is built-in to Postgres, and the same thing Bucardo is using behind the scenes. But AWS DMS is very nearly point-and-click to do this sort of migration.

评论 #27476185 未加载

评论 #27476471 未加载

评论 #27513242 未加载

jeffbeealmost 4 years ago

Bucardo has no performance impact, it just adds a trigger to every mutation. Negligible!I really think articles of this kind are not useless, but need to explicitly narrow their audience to set expectation at the start. This particular topic is PG replication for users who aren't very sensitive to write latency.

评论 #27513216 未加载

booleanbetrayalalmost 4 years ago

Has anyone here leveraged pglogical for this before? Looking at a similar migration but it has native extension support in AWS RDS. Would love to hear any success / horror stories!

Nextgridalmost 4 years ago

Internet Archive link in case the original is overloaded: <a href="https://web.archive.org/web/20210611143214/https://engineering.theblueground.com/blog/zero-downtime-postgres-migration-done-right/" rel="nofollow">https://web.archive.org/web/20210611143214/https://engineeri...</a>

endisneighalmost 4 years ago

I wonder how much easier software engineering would be if there were a period where things are simply not available.What problems are currently very difficult would be made trivial if 6 hours of downtime every Sunday were acceptable? 10PM-4AM EST

评论 #27475930 未加载

评论 #27475875 未加载

评论 #27477458 未加载

评论 #27479252 未加载

评论 #27476354 未加载

评论 #27475941 未加载

评论 #27494534 未加载

评论 #27479821 未加载

mmanulisalmost 4 years ago

Can this be done using existing PostgreSQL functionality around replicas? I think there's a plugin for PostgreSQL that supports master-master replication as well.

omotalmost 4 years ago

I always tell my non technical friends: “Migrations are like surgeries, and we’re letting new grads lead them.”

silviogutierrezalmost 4 years ago

Very interesting article. But I have to ask: would taking down the system for a couple of hours be that bad?I looked at the company, and while they seem rather large, they're not Netflix or AWS.I imagine they need to be up for people to be able to check in, etc. But they could just block out the planned maintenance as check in times far in advance. I'm sure there's a million other edge cases but those can be thought out and weighed against the engineering effort.Don't get me wrong, this is very cool.But I wonder what the engineering cost was. I'd think easily in the hundreds of thousands of dollars.

评论 #27476093 未加载

评论 #27475426 未加载

评论 #27475953 未加载

评论 #27477249 未加载

评论 #27513635 未加载

评论 #27477336 未加载

评论 #27513642 未加载

andrewmcwattersalmost 4 years ago

What is the equivalent for MySQL?

评论 #27481227 未加载

rhackeralmost 4 years ago

"the zero downtime" + this site is definitely down is making me laugh.

评论 #27477181 未加载

ckboii89almost 4 years ago

does anyone know if this works when the target database is a replica/standby? The downside using pg_dump is that it acquires a lock on the table its dumping, and doing this on production may cause some slowness

holodukealmost 4 years ago

Looking for something similar for a mariadb setup. Anyone knows some resources?

评论 #27475361 未加载

data_dersalmost 4 years ago

way cool!fyi the hyperlink "found here" to setup_new_database.template is broken.

coolspotalmost 4 years ago

I just don’t get all the “just shutdown the db for an hour, you’re not Netflix” comments.If you can do things properly, as an engineer, you absolutely should, even if your company serves “just” hundreds of thousands instead of hundreds of millions users.It is not like they wrote their own database for that, they just used an open source tool.

评论 #27477025 未加载

评论 #27477355 未加载

jbverschooralmost 4 years ago

> Blueground is a real estate tech company offering flexible and move-in ready furnished apartments across three continents and 12 of the world’s top cities. We search high and low for the best properties in the best cities, then our in-house design team transforms these spaces into turnkey spaces for 30 days or longer.Seriously, how big can that db be, and how bad would a 1hr reduced availability / downtime be?Seems like a lot of wasted engineering effort. “You are not google”

评论 #27475768 未加载

评论 #27476096 未加载

评论 #27479469 未加载

评论 #27476084 未加载