This entire situation is a barely contained disaster. Doing all the steps listed at the top of the article is not only exhausting and error prone, it also dramatically lowers write performance.<p>The most frustrating part is that (as I understand it) SSD and NVME hardware exposes strict read and write ordering to the OS anyway. That would allow file modification code to be written in a way thats both fast and correct. After all, these are the primitives we already know how to use for memory concurrency which enable fast lockless libraries like ConcurrencyKit. But for files, posix only exposes an inconsistently-implemented, coarse, heavyweight fsync() implementation. End users are forced to navigate weird, undocumented, imprecise hoops that are hard to understand, hard to test, and have poor performance just to do the one job the filesystem was supposed to have in the first place.<p>I'm curious how much better you could do by skipping the OS entirely and making a userspace disk API, similar to DPDK. If your database code stores data in a single data file anyway, you wouldn't lose much by ditching the filesystem. (You would need specialised tools to do backups and figure out how much space you have free, but it might be worth it.)<p>I've been writing my own tiny implementation of Kafka recently. I was reading through Kafka's design docs to figure out how they solved this problem. Kafka basically gives up on trusting the OS to store files safely. Instead they figure any fault tolerant kafka deployment will be a cluster of machines, so kafka stores all messages (+checksums) across all cluster instances. It hopes at least one of the machines in the cluster will survive without corruption when power goes out.
Looks to me like yet another reason to use sqlite instead of «flat files». E.g postgres seems like a many orders of magnitude increase in complexity.<p>Possibly related: We’ve been using hdf5 for a lot of data storage at work (raw image data). I often discover corrupt files, even though we (think we) are flushing files etc. I’d love to see some work on reliability there, but it’s hard to know if the article is relevant to those issues.<p>Also, what happens when you’re on RAID? Even more assumptions out the window I’d imagine?
This was a cool read. I'd be interested in a perspective on how the increasing focus on distributed consistency is impacting design and research at this local consistency level. In particular, given the findings about frequency of errors, I wonder if there are guidelines for coordinating local filesystem settings along with distributed system settings to maximize performance at the distributed scale. Anybody out there doing this?
In linux, renameat2() with RENAME_EXCHANGE on directories ought to be a very helpful primitive.<p>Does anyone know what the state of glibc support is? The last thing I saw was this thread: <a href="https://sourceware.org/ml/libc-alpha/2015-11/msg00459.html" rel="nofollow">https://sourceware.org/ml/libc-alpha/2015-11/msg00459.html</a>
I am wondering what is the easy way to run tests for such file consistency issues say how the system react to power loss? Unplugging the power cable is not that automated for most people.
Are there no filesystems which address this issue? What challenges are involved? Does hardware support safe append-write filesystems? Why/why not?
Discussed at the time: <a href="https://news.ycombinator.com/item?id=10725859" rel="nofollow">https://news.ycombinator.com/item?id=10725859</a>