> <i>If it takes a long time to copy a database and it gets updated midway through, rsync may give me an invalid database file. The first half of the file is pre-update, the second half file is post-update, and they don’t match. When I try to open the database locally, I get an error</i><p>Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.<p>For replicating sqlite databases safely there is<p><a href="https://github.com/benbjohnson/litestream">https://github.com/benbjohnson/litestream</a>
How to copy databases between computers? Just send a circle and forget about the rest of the owl.<p>As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.<p>Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.
SQLite has an official tool for this, fwiw: <a href="https://www.sqlite.org/rsync.html" rel="nofollow">https://www.sqlite.org/rsync.html</a><p>It works at the page level:<p>> The protocol is for the replica to send a cryptographic hash of each of its pages over to the origin side, then the origin sends back the complete content of any page for which the hash does not match.
The recently released sqlite_rsync utility uses a version of the rsync algorithm optimized to work on the internal structure of a SQLite database. It compares the internal data pages efficiently, then only syncs changed or missing pages.<p>Nice tricks in the article, but you can more easily use the builtin utility now :)<p>I blogged about how it works in detail here: <a href="https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-works" rel="nofollow">https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...</a>
Saving to text file is inefficient. I save sqlite databases using VACUUM INTO, like this:<p><pre><code> sqlite3 -readonly /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"
</code></pre>
From <a href="https://sqlite.org/lang_vacuum.html" rel="nofollow">https://sqlite.org/lang_vacuum.html</a> :<p><pre><code> The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced.</code></pre>
I was surprised that he didn't try to use on the flight compression, provided by rsync:<p><pre><code> -z, --compress compress file data during the transfer
--compress-level=NUM explicitly set compression level
</code></pre>
Probably it's faster to compress to gzip and later transfer. But it's nice to have the possibility to improve the transfer with a a flag.
In DuckDB you can do the same but export to Parquet, this way the data is an order of magnitude smaller than using text-based SQL statements. It's faster to transfer and faster to load.<p><a href="https://duckdb.org/docs/stable/sql/statements/export.html" rel="nofollow">https://duckdb.org/docs/stable/sql/statements/export.html</a>
SQLite has an session extension, which will track changes to a set of tables and produce a changeset/patchset which can patch previous version of an SQLite database.<p><a href="https://www.sqlite.org/sessionintro.html" rel="nofollow">https://www.sqlite.org/sessionintro.html</a>
If you're regularly syncing from an older version to a new version, you can likely optimize further using gzip with "--rsyncable" option. It will reduce the compression by ~1% but make it so differences from one version to the next are localized instead of cascading through the full length of the compression output.<p>Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)
I think this could be a single pipeline?<p>ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c" | gunzip -c | sqlite3 my_local_database.db
Great example of Cunningham's Law: <a href="https://en.wikipedia.org/wiki/Ward_Cunningham#:~:text=%22Cunningham's%20Law%22" rel="nofollow">https://en.wikipedia.org/wiki/Ward_Cunningham#:~:text=%22Cun...</a>
Wait... why would you even think about rsyncing a database that can get changed while being copied?<p>Isn't this a case for proper database servers with replication?<p>Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?
Does the author not know that rsync can use compression (rsync -z | --compress | --compress-level=<n> ), or does he not think it worthwhile to compare that data point?<p>I just tried some comparisons (albeit with a fairly small sqlite file). The text compressed to only about 84% of the size of the compressed binary database, which isn't negligible, but not necessarily worth fussing over in every situation. (The binary compressed to 7.1%, so it's 84% relative to that).<p>bzip2 performed better on both formats; its compression of the binary database was better than gzip's compression of the text (91.5%) and bzip2's text was better than binary (92.5).<p>Though that is not available inside rsync, it indicates that if you're going with an external compression solution, maybe gzip isn't the best choice if you care about every percentage reduction.<p>If you don't care about every percentage reduction, maybe just rsync compression.<p>One thing worth mentioning is that if you are updating the file, rsync will only compress what is sent. To replicate that with the text solution, you will have to be retaining the text on both sides to do the update between them.
I am sure you can just pipe all this so you don't have to use an intermediate gunzip file.<p>Just ssh the machine, dump the SQL and load it back into SQLite locally.
I used to work at a company that had a management interface that used sqlite as database, its multi-node / fallover approach was also just... copying the file and rsyncing it. I did wonder about data integrity though, what if the file is edited while it's being copied over? But there's probably safeguards in place.<p>Anyway I don't think the database file size was really an issue, it was a relatively big schema but not many indices and performance wasn't a big consideration - hence why the backend would concatenate query results into an XML file, then pass it through an xml->json converter, causing 1-2 second response times on most requests. I worked on a rewrite using Go where requests were more like 10-15 milliseconds.<p>But, I still used sqlite because that was actually a pretty good solution for the problem at hand; relatively low concurrency (up to 10 active simultaneous users), no server-side dependencies or installation needed, etc.
sqlite provide the sqlite3_rsync command to safely copy databases <a href="https://sqlite.org/rsync.html" rel="nofollow">https://sqlite.org/rsync.html</a>
Ignoring for the moment issues of syncing a database where the source DB may be running inserts/updates:<p>if one dumps tables as separate CSV files/streams and using DuckDB converts them to individual parquet files the rsync should run faster since hopefully not every table is modified between each new syncing. There is an obvious overhead of the back and forth conversions but DuckDB can directly export a database to SQLite.
I have not tested it myself, so it is just a brainstorming.<p>Last but not least: when compressing/decompressing text dumps use igzip or pigz if you want to speed things up. Also benchmark the compression levels.
This reminds me of something I did back in 2008, we were using Postgres, not SQLite, and we needed to get a backup from one production machine onto several other machines. Copying saturated the network for an hour - And we needed to copy four times if I remember right. We didn't want to wait four hours.<p>I found this tool laying around "udpcast" and used it to send the backup over the network just once to all the destinations at the same time.<p><a href="https://kylecordes.com/2008/multicast-your-db-backups-with-udpcast" rel="nofollow">https://kylecordes.com/2008/multicast-your-db-backups-with-u...</a>
Since sqlite is just a simple file-level locking DB, I'm pretty shocked they don't have an option to let the indexes be stored in separate files for all kinds of obvious and beneficial reasons, like the fact that you can easily exclude them from backups if they were, and you can make them "rebuild" just by deleting them. Probably their reason for keeping all internal has to do with being sure indexes are never out of sync, but that could just as easily be accomplished with hashing algos.
Git works pretty well for syncing binary files. You can even do this:<p><pre><code> $ cat .gitattributes
*.db diff=sqlite3
</code></pre>
And you can know how the sqlite files changed between commits--but yeah if you make commits while a process is writing new rows or updating data then you'll have data loss if you make a commit on another machine--better to use something like litestream in that case
In case someone is looking for a proper way of making db copy in PostgreSQL and not SQLLite, pg_basebackup is the proper way: <a href="https://www.postgresql.org/docs/current/app-pgbasebackup.html" rel="nofollow">https://www.postgresql.org/docs/current/app-pgbasebackup.htm...</a>
It all depends what you’re optimizing for.<p>Clearly, when you don’t transfer indexes, you will not have to transfer as much data.<p>However, the tradeoff is that the database is going to have to do more work to regenerate those indexes when you reconstruct it from the text dump at the destination (as opposed to the case where the indexes were included)
???<p>Why not just compress the whole database using `gzip` or `lz4` before rsyncing it instead? `zstd` works too but seems like it had a bug regarding compressing file with modified content.<p>better yet, split your sqlite file to smaller piece. it is not like it needs to contain all the app data in a single sqlite file.
I’ve been looking into a way to replicate a SQLite database and came across the LiteFS project by Fly.io. Seems like a solid drop-in solution backed by FUSE and Consul. Anybody used it in production? My use case is high availability between multiple VMs.
I have recently discovered a tool called mscp which opens open multiple scp threads to copy down large files. It works great for speeding up these sorts of downloads.<p><a href="https://github.com/upa/mscp">https://github.com/upa/mscp</a>
zstd would be a better choice. It’s bonkers fast (especially when used with multithreading) and still compresses better than gzip. Alternatively, I’d recommend looking into bzip3, but I’m not sure if it would save time.
How long does this procedure take in comparison to the network transfer?<p>My first try would've been to copy the db file first, gzip it and then transfer it but I can't tell whether compression will be that useful in binary format.
Pretty good point. I just wonder if databases in generally can be perfectly reconstructed from a text dump. For instance, do the insertion orders change in any of the operations between dumping and importing?
I guess for me it is obvious you don't try to copy running DB only a backup.<p>So I see basic stuff needs to be repeated as people still miss those kinds of things.<p>But I learned that you can easily dump SQLite to a text file - neat!
isn't this rather obvious? doesn't everyone do this when it makes sense? obviously, it applies to other DBs, and you don't even need to store the file (just a single ssh from dumper to remote undumper).<p>if retaining the snapshot file is of value, great.<p>I'd be a tiny bit surprised if rsync could recognize diffs in the dump, but it's certainly possible, assuming the dumper is "stable" (probably is because its walking the tables as trees). the amount of change detected by rsync might actually be a useful thing to monitor.
how well does just the sqlite database gzip, the indexes are a lot of redundant data so your going to get some efficiencies there, probably less locality of data then the text file though so maybe less?