Building a better and scalable system for data migrations

68 点作者 YorickPeterse7 个月前

16 条评论

WJW7 个月前

I agree with most points but would like to add one about "tunability". This article focuses mostly on correctness, which is of course important (and without which the whole system would be useless) but in itself is often not sufficient for large systems. I've seen it happen more than once that a big table rewrite causes so much load the database might as well be down completely.One thing I would love to see implemented more thoroughly in database systems is the ability to throttle and/or pause the load that DML can impose. Normal SQL only says "what" the ALTER TABLE should do, but the "how" is often rather lacking. At most you get a CREATE INDEX CONCURRENTLY in postgres or an "ALGORITHM=instant" in mysql, but rarely do you get finegrained enough to say "use at most XYZ iops for this task", let alone that you can vary that XYZ variable dynamically or to assign priorities to load caused by different queries.AFAIK TiDB and pt-osc provide ways to pause a running migration, gh-ost can also throttle a migration dynamically. Vitess also has several ways to manage migration, as it leverages gh-ost. For postgress I don't think any of the currently popular tools have good ways to manage this, but I would love to be proven wrong.

评论 #41984817 未加载

评论 #41984081 未加载

sgarland7 个月前

> As I've hinted at before, migrations should be functions. Functions as in "a function written in a programming language", not functions as in "SQL snippets I just refer to as functions because it makes me sound smart".Disagree. SQL is a declarative language that is both clear and universal. Feel free to generate the migrations however you’d like, but I want to see the SQL.Another bonus (and I’m sure I’ll be told I’m wrong) is that you don’t need to write tests for the migration, because it’s declarative. Assuming you know what it means (if you don’t, maybe you shouldn’t be administering a DB) and what its locking methods entail, you will get _precisely_ what it says, and nothing more. If you get a failure from, say, a duplicate entry when creating a UNIQUE constraint, that’s orthogonal to the migration itself – you described the end state, the DB tried to make that happen, but was unable to do so due to issues with your data. All the tests in the world wouldn’t catch that, short of selecting and deduping the column[s], and at that point, you’re just creating work.I am firmly convinced that any and all infrastructure should be declaratively instantiated, and declaratively maintained. I do not want or need to test my Terraform files, nor my DDL.

评论 #41984245 未加载

toolslive7 个月前

Not exactly an SQL database but we were migrating data _constantly_ (dead devices, load balancing, ...) for the data/metadata of an exabyte scale object store...Once client code can work with the mixed state (while the migration is in progress) It no longer matters how long it takes. Once the migration is robust enough so it can handle crashes, interrupts, ... it no longer matters how often you trigger the "continue". The migration is using too many iops ? just kill it, schedule a continuance later.Also, your smallest step needs to be an atomic multi-update (you don't want to bother with partial failures)

fabianlindfors7 个月前

For people interested in this subject, you might also be interested in my project Reshape: <a href="https://github.com/fabianlindfors/reshape">https://github.com/fabianlindfors/reshape</a>. It takes database migrations a step further than most tools, guaranteeing zero-downtime migration for both schema and data changes.

评论 #41982659 未加载

yayitswei7 个月前

The database I'm working with, Rama, has an interesting way to do "instant migrations" at scale. Your migration function runs on every read until data is durably migrated in the background. So you get migrated data immediately after deploy, while Rama quietly persists the changes during normal processing. No downtime.<a href="https://blog.redplanetlabs.com/2024/09/30/migrating-terabytes-of-data-instantly-can-your-alter-table-do-this/" rel="nofollow">https://blog.redplanetlabs.com/2024/09/30/migrating-terabyte...</a>

tinodb7 个月前

I’m a bit surprised to hear that is the state of migrations in Rails. Being a Django developer I’m used to having most of the points you mention covered for a really long time already.It’s timeless by looking at a captured schema, that doesn’t change with later code changes. Which means you can’t import / use methods of models and have to copy those over. So a different way than using the vcs for this, but still.Not sure I’d name the second point “scalable”, but you can easily do data migrations as well.It is really easy to use!And you can certainly write tests for it, although it isn’t included in the framework [0]. But you can do that with just SQL too [1].What I find much harder with (somewhat) larger dbs, is is a) determining whether it will lock up too much and b) whether it backwards compatible (so that you can roll back). Which is splitting in these pre- and post-migration steps as the article mentions. We currently use a linter for that but it is still a bit basic [2].[0] <a href="https://www.caktusgroup.com/blog/2016/02/02/writing-unit-tests-django-migrations/" rel="nofollow">https://www.caktusgroup.com/blog/2016/02/02/writing-unit-tes...</a>[1] <a href="https://tapoueh.org/blog/2017/08/sql-regression-tests/" rel="nofollow">https://tapoueh.org/blog/2017/08/sql-regression-tests/</a>[2] <a href="https://github.com/3YOURMIND/django-migration-linter">https://github.com/3YOURMIND/django-migration-linter</a>

jensenbox7 个月前

I have been eyeing up <a href="https://atlasgo.io/" rel="nofollow">https://atlasgo.io/</a> - I am going to give it a whirl soon.

oulipo7 个月前

Not directly linked to migration (although I'd love to know the best way to handle those when you have a kind of Datalake with big number of past tables that have been archived on S3)I'm trying to move my database to Postgres, there is a part which is "describing all the objects" (object id, properties, etc), and a huge table which is a log of events, that I'm storing in case I want to data-mine it later.Of course this last table is:huge (or should become huge at some point) better suited by columnar storage might be archived from time to time on S3 My initial thinking was to store it in Postgres "natively" or as a "duckdb/clickhouse" extension with postgres-querying capabilities, keep the last 90 days of data in the database, and regularly have a script to export the rest as Parquet files on S3does this seem reasonable? is there a "best practice" to do this?I also want to do the same with "audit logs" of everything going in the system (modifications to the fields, actions taken by users on the dashboard, etc)what would you recommend?

vc2897 个月前

It would be interesting to consider branchable DBs as part of the migration cycle, especially for forward and backwards time skips.While not dealing with that kind of scale yet, our application (An AI Data Engineer that has done migration work for users) needs to do before and after comparisons and find diffs. We use a branchable DB to compute those changes efficiently (DoltGres)Could be an interesting thing to consider since it's worked well for us for that part.Our build if u wanna check that out too -> <a href="https://Ardentai.io" rel="nofollow">https://Ardentai.io</a>

victorNicollet7 个月前

Avoiding SQL migrations was my #1 reason for moving to event sourcing.This approach cuts the "database server" into an event stream (an append-only sequence of events), and a cached view (a read-only database that is kept up-to-date whenever events are added to the stream, and can be queried by the rest of the system).Migrations are overwhelmingly cached view migrations (that don't touch the event stream), and in very rare cases they are event stream migrations (that don't touch the cached view).A cached view migration is made trivial by the fact that multiple cached views can co-exist for a single event stream. Migrating consists in deploying the new version of the code to a subset of production machines, waiting for the new cached view to be populated and up-to-date (this can take a while, but the old version of the code, with the old cached view, is still running on most production machines at this point), and then deploying the new version to all other production machines. Rollback follows the same path in reverse (with the advantage that the old cached view is already up-to-date, so there is no need to wait).An event stream migration requires a running process that transfers events from the old stream to the new stream as they appear (transforming them if necessary). Once the existing events have been migrated, flip a switch so that all writes point to the new stream instead of the old.

brunoarueira7 个月前

I liked the line of thought from the author and suffered similar points at a much small scale! Once, I'd planned to do a migration, which moved multiple columns from one table to another and the associated data, but calculated wrong the time to complete and the amount of CPU/RAM, if I'd the right tools, probably I'd scheduled better and planned a better approach.

gregw27 个月前

Considered doing the migration via Liquibase? I dont know that it supports parallelism in the migration process but you could do most of the rest in it (pre/post migration scripts, rollbacks, test cases (within precondition changesets), etc... Not a real programming language though; its xml config files or annotated SQL scripts...

评论 #41979233 未加载

评论 #41978276 未加载

tianzhou7 个月前

This post addresses the data migration specifics well. Besides, there is the part about scaling the organization to deploy database migrations. We've built a project <a href="https://github.com/bytebase/bytebase">https://github.com/bytebase/bytebase</a> to address this.

jakozaur7 个月前

The SQL migrations are when running in-place, hard to reverse changes, where production is the only realistic env.I wish there was a better way to run blue/green DB deployments. Though this feature is rare (e.g. gh-ost) and not that usable at less than bug tech scale.

评论 #41980493 未加载

bjornsing7 个月前

Having worked with migrating petabyte scale sharded SQL databases I get a sinking feeling when I think about automating this. I think it would actually be a pretty good test case for artificial super intelligence.

Temporary_313377 个月前

I work for a nosql db vendor. We don’t have a fixed schema so there are no migrations. If you want fast access to some parts of data you just add an index and reindex async. If you need SQL for Tableau or PowerBI you just add a view which is also like an index.But the underlying data and it’s model can be in flux and we handle exabyte scale ha dr rebalancing etc

评论 #41984211 未加载