If you can afford a one off 1 second of latency for your SQL queries, then using logical replication with pgbouncer seems way easier :<p>- setup logical replication between the old and the new server (limitations exist on what is replicated, read the docs)<p>- PAUSE the pgbouncer (virtual) database. Your app will hang, but not disconnect from pgbouncer<p>- Copy the sequences from the old to new server. Sequences are not replicated with logical replication<p>- RESUME the pgbouncer virtual database.<p>You're done. If everything is automated, your app will see a temporary increase of the SQL latency. But they will keep their TCP connections, so virtually no outage.
Braintree (IIRC) had a really clever migration strategy, although I can't seem to find the blog post now. They paused all traffic at the load balancer, cut over to the new DB, and then resumed traffic. No requests failed, just a slight bump in latency while the LBs were paused.<p>This app apparently had robust enough retry mechanisms that they were able just eat the errors and not have customer issues—Color me impressed! I'm not sure how many teams can make that claim; that's a hard thing to nail down.
This is one of the areas where Postgres is so far behind MySQL is embarrassing. Zero downtime migrations in MySQL have been a common method for over 10 years. This solution is far from ideal due to the use of Triggers which can greatly increase the load on the database and slow down transactions. If you don't have a lot of load on your DB thats fine, but if you are pushing your DB this will bring it down.<p>In MySQL they started with triggers with PT-OSC, but now there is GH-OST which does it with Replication. You can do something like this with Postgres by using Logical replication, but its still requires hand holding, and to my knowledge there is no way to "throttle" the migration like you can with GH-OST. Where I work now we are building all this out so we can have first class online migrations, but the chasm is still pretty big.
Its insane that it has to be this complex and require third party software to accomplish…<p>Most modern rdbms/nosql database vendors allow a rolling upgrade where you roll in new servers and roll out the old ones seamlessly.<p>Also the fact that AWS rds doesnt do this with zero downtime by default through automating it this way is also crazy. Why pay for hosted when the upgrade story is incomplete? Take downtime to upgrade a DB in 2021? Everyone must be joking.
I didn't read this article, but I really hate the tag line "done right". It expresses such a poor sense of humility, which is one of, or perhaps the most, important traits in the world of software
If you are in AWS, or have connectivity available, AWS Database Migration Service makes this relatively trivial.<p>DMS for Postgres is based on Postgres Logical Replication, which is built-in to Postgres, and the same thing Bucardo is using behind the scenes. But AWS DMS is very nearly point-and-click to do this sort of migration.
Bucardo has no performance impact, it just adds a trigger to every mutation. Negligible!<p>I really think articles of this kind are not useless, but need to explicitly narrow their audience to set expectation at the start. This particular topic is PG replication for users who aren't very sensitive to write latency.
Has anyone here leveraged pglogical for this before? Looking at a similar migration but it has native extension support in AWS RDS. Would love to hear any success / horror stories!
Internet Archive link in case the original is overloaded: <a href="https://web.archive.org/web/20210611143214/https://engineering.theblueground.com/blog/zero-downtime-postgres-migration-done-right/" rel="nofollow">https://web.archive.org/web/20210611143214/https://engineeri...</a>
I wonder how much easier software engineering would be if there were a period where things are simply not available.<p>What problems are currently very difficult would be made trivial if 6 hours of downtime every Sunday were acceptable? 10PM-4AM EST
Can this be done using existing PostgreSQL functionality around replicas? I think there's a plugin for PostgreSQL that supports master-master replication as well.
Very interesting article. But I have to ask: would taking down the system for a couple of hours be <i>that</i> bad?<p>I looked at the company, and while they seem rather large, they're not Netflix or AWS.<p>I imagine they need to be up for people to be able to check in, etc. But they could just block out the planned maintenance as check in times far in advance. I'm sure there's a million other edge cases but those can be thought out and weighed against the engineering effort.<p>Don't get me wrong, this is very cool.<p>But I wonder what the engineering cost was. I'd think easily in the hundreds of thousands of dollars.
does anyone know if this works when the target database is a replica/standby? The downside using pg_dump is that it acquires a lock on the table its dumping, and doing this on production may cause some slowness
I just don’t get all the “just shutdown the db for an hour, you’re not Netflix” comments.<p>If you can do things properly, as an engineer, you absolutely should, even if your company serves “just” hundreds of thousands instead of hundreds of millions users.<p>It is not like they wrote their own database for that, they just used an open source tool.
> Blueground is a real estate tech company offering flexible and move-in ready furnished apartments across three continents and 12 of the world’s top cities. We search high and low for the best properties in the best cities, then our in-house design team transforms these spaces into turnkey spaces for 30 days or longer.<p>Seriously, how big can that db be, and how bad would a 1hr reduced availability / downtime be?<p>Seems like a lot of wasted engineering effort. “You are not google”