This entire situation is a barely contained disaster. Doing all the steps listed at the top of the article is not only exhausting and error prone, it also dramatically lowers write performance.<p>The most frustrating part is that (as I understand it) SSD and NVME hardware exposes strict read and write ordering to the OS anyway. That would allow file modification code to be written in a way thats both fast and correct. After all, these are the primitives we already know how to use for memory concurrency which enable fast lockless libraries like ConcurrencyKit. But for files, posix only exposes an inconsistently-implemented, coarse, heavyweight fsync() implementation. End users are forced to navigate weird, undocumented, imprecise hoops that are hard to understand, hard to test, and have poor performance just to do the one job the filesystem was supposed to have in the first place.<p>I'm curious how much better you could do by skipping the OS entirely and making a userspace disk API, similar to DPDK. If your database code stores data in a single data file anyway, you wouldn't lose much by ditching the filesystem. (You would need specialised tools to do backups and figure out how much space you have free, but it might be worth it.)<p>I've been writing my own tiny implementation of Kafka recently. I was reading through Kafka's design docs to figure out how they solved this problem. Kafka basically gives up on trusting the OS to store files safely. Instead they figure any fault tolerant kafka deployment will be a cluster of machines, so kafka stores all messages (+checksums) across all cluster instances. It hopes at least one of the machines in the cluster will survive without corruption when power goes out.