This is a lovely document, typical of SQLite authors and why it's such a great piece of software.<p>If they are going to ask kernel filesystem authors to provide a better API, I have long felt there is one obvious API type which is missing from Linux and would be better for filesystem databases than other approaches:<p><pre><code> fdatasync_rangev()
</code></pre>
That is, like fdatasync() (including committing size & filesystem structure blocks needed to retrieve committed data later), but only committing the data within a set of byte ranges, returning when that's done. So that it doesn't have to wait for all the rest of the data to be written.<p>That would allow the WAL or journal to be efficiently part of the database file itself. No need for extra files, file creation and deletion, directory operations or syncing directories. That's a bunch of durable metadata writes that can be removed, and filesystem implementation details that can be avoided.<p>It would also allow atomic and durable writes without the need for writing twice in some cases, depending how the data is structured. (First write to the WAL or rollback journal, then to the database main file). That's a bunch of data writes avoided.<p>In cases where it can avoid two writes, that would also remove the intermediate barrier sync, doubling the commit-to-sync throughput.<p>Ideally like other I/O (or at least reads) that should be available in a sensible async manner too, with completion when the covered data is committed durably, and only that data. I'm not sure what system API would provide good async I/O for something like SQLite though, if it's not able to use AIO or ui_uring due to kernel API being poorly suited to a self-contained library.<p>Finally, a couple of variant (probably via flags).<p>"No rush" flag, to let the caller wait until those ranges are committed durably, but not force them to be written faster than normal. That would allow ordinary fast buffered-write speed, while at the same time providing the usual ACID guarantees that a DB returns success from COMMIT when the data is durably committed. For some workloads that would be fine.<p>"Barrier" flag, make the call return immediately but delay all subsequent writes on the same filedescriptor until the ranges are synced durably. This is similar to what's mentioned in the article, but more versatile by attaching to ranges. It's also not strictly necessary if you have the "no rush" flag.