The other half of the thread was on linux-ext4, starting here:
<a href="https://lists.openwall.net/linux-ext4/2018/04/10/33" rel="nofollow">https://lists.openwall.net/linux-ext4/2018/04/10/33</a><p>The part I found most interesting was here:
<a href="https://lists.openwall.net/linux-ext4/2018/04/12/8" rel="nofollow">https://lists.openwall.net/linux-ext4/2018/04/12/8</a><p>where the ext4 maintainer writes:<p>«
The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring
machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon.
»<p>He later says that the netlink channel stuff was never submitted to the upstream kernel.<p>It all feels like a situation where the people maintaining this code knew deep down that it wasn't really up to scratch, but were sufficiently used to workarounds ("use direct io", "scrape dmesg") that they no longer thought of it as a problem.
I sure wonder how IBM mainframes and other computer systems handle this intractable failure case. Joking aside, here's an excerpt from ''The UNIX-HATERS Handbook'':<p><i>Only the Most Perfect Disk Pack Need Apply</i><p>One common problem with Unix is perfection: while offering none of its own, the operating system demands perfection from the hardware upon which it runs. That's because Unix programs usually don't check for hardware errors--they just blindly stumble along when things begin to fail, until they trip and panic. (Few people see this behavior nowadays, though, because most SCSI hard disks do know how to detect and map out blocks as the blocks begin to fail.)<p>...<p>In recent years, the Unix file system has appeared slightly more tolerant of disk woes simply because modern disk drives contain controllers that present the illusion of a perfect hard disk. (Indeed, when a modern SCSI hard disk controller detects a block going bad, it copies the data to another block elsewhere on the disk and then rewrites a mapping table. Unix never knows what happened.) But, as Seymour Cray used to say, ''You can't fake what you don't have.'' Sooner or later, the disk goes bad, and then the beauty of UFS shows through.
Related article: PostgreSQL's fsync() surprise <a href="https://lwn.net/Articles/752063/" rel="nofollow">https://lwn.net/Articles/752063/</a> (April 18, 2018)<p>And the followup coverage from LSFMM summit (linked also in the OP discussion): <a href="https://lwn.net/Articles/752613/" rel="nofollow">https://lwn.net/Articles/752613/</a>
One thing I took away from this thread: when someone tells you something surprising, it's best to ask, rather than deny. It doesn't look good to say things like:<p>"Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued."<p>When you have plainly not actually verified that it is the case.<p>Instead, you could say, "doesn't Posix say . . .?" This has the following benefits: you avoid egg on your face, the conversation takes on a less aggressive tone, and problems are resolved quicker.
It's kinda depressing that after a quarter of a century of work in the region of a hundred thousand dev years and billions of dollars of investment the world's most used operating system fails at the most basic tasks like reliably writing files or allocating memory.
PostgreSQL will now PANIC on fsync() failure<p><a href="https://wiki.postgresql.org/wiki/Fsync_Errors" rel="nofollow">https://wiki.postgresql.org/wiki/Fsync_Errors</a><p><a href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1" rel="nofollow">https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit...</a><p><a href="https://lwn.net/Articles/752063/" rel="nofollow">https://lwn.net/Articles/752063/</a>
Well, at least in this case one can abort (or go into some read-only mode) in case of fsync() returning failure. With most storage media that is the correct thing to do anyway. Having multiple processes and having the fsync() error returned to only one of them is problematic though.<p>I recently found out that syncfs() doesn't return an error at all in most cases (through data loss :/). It's being worked on ... <a href="https://lkml.org/lkml/2018/6/1/640" rel="nofollow">https://lkml.org/lkml/2018/6/1/640</a><p>It's astonishing that such critical issues are still present in such a widely used piece of software.
Imagine a juggler, juggling 5 balls at a time. At a set period of time, he will drop one ball and accept another ball thrown at him. He handles this very well because there is order/cadence.<p>Now imagine asking him to support being thrown balls randomly, at any interval. He may make it work, but I'd imagine he will stutter a bit.<p>In my experience, anytime you interrupt the page cache's normal routine, it stutters everything. I've seen the "sync" command freeze my Ubuntu machine (music player, GUI, etc).<p>I work on embedded devices, and my employer wanted to reduce the window at which data-loss would happen for 30mb+ files (video capture). It wasn't a supported use case, "But why not! It makes our product theoretically better!" I put my foot down. We aren't touching page-cache until there is a clear benefit to the user. It almost got me fired, but good riddance if so.
If fsync() fails isn't valhalla lost anyhow? If you can't write things down because your pencil is broken, probably time to stop what you're doing and get a new pencil.<p>If the kernel can't flush dirty write buffers, maybe it's time to send up a flag and panic in the kernel itself?
This is sad, I know that one large user of Linux found this problem in 2009 or so and fixed it for the version of Linux they used in their fleet of servers. I am surprised it didn't make it upstream from then.
I read the start and the end of thread but couldn’t get an understranding of what the current situation is : did linux update its fsync behavior ? Does pg now panics on linux on the first fsync ?
Every computer component can fail in arbitrary ways, including drives.<p>If you’re not robust against that, then when things like fsync fail, then you’ll lose availability and/or data.<p>Even though Linux’s fsync behavior is clearly broken, it is far from the craziest behavior I’ve seen from the I/O stack.<p>Anyway, the main lesson here is that untested error handling is worse than no error handling. They should have figured out how to test that this path actually proceeds correctly (on real, intermittently failing hardware) or just panicked the process.
Love that FreeBSD is doing things right - and has been for 20 years.<p><a href="https://wiki.postgresql.org/wiki/Fsync_Errors" rel="nofollow">https://wiki.postgresql.org/wiki/Fsync_Errors</a>
2007, Linus rant: <a href="https://lkml.org/lkml/2007/1/10/233" rel="nofollow">https://lkml.org/lkml/2007/1/10/233</a><p><pre><code> The right way to do it is to just not use O_DIRECT.
The whole notion of "direct IO" is totally braindamaged. Just say no.
This is your brain: O
This is your brain on O_DIRECT: .
Any questions?
I should have fought back harder. There really is no valid reason for EVER
using O_DIRECT. You need a buffer whatever IO you do, and it might as well
be the page cache. There are better ways to control the page cache than
play games and think that a page cache isn't necessary.
So don't use O_DIRECT. Use things like madvise() and posix_fadvise()
instead.
</code></pre>
2019, how things are:<p><pre><code> This is your brain: O
This is your brain on O_DIRECT: .
And... this is your brain when cached: ?!
The right way to do it is to just use O_DIRECT.
The whole notion of "kernel IO" is fsync and games. Just say no.</code></pre>
This is one reason we choose to use Windows Storage Spaces Direct and the tranactional NTFS. They really are better.<p><a href="https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/storage-spaces-direct-overview" rel="nofollow">https://docs.microsoft.com/en-us/windows-server/storage/stor...</a>
So if fsync fails, what is one supposed to do? You can't retry it and you don't know how much of the file that has been synced?<p>Only feasible option is to create a completely new file and retry writing there? And if that fails your disk is probably bust or ejected, which should require user interaction about the new file location. Doesn't seem too unreasonable?<p>This would require you to have the complete file contents elsewhere so you can rewrite it? Or would it still be possible to read from the original file being in the unflushed buffer? And in the disk ejected+remounted case, the old contents should still be there intact thanks to ext4 journaling?
I didn't read the entire thread, so maybe this was answered: has anyone actually made a system that's "fully" correct with regards to file system errors? Most people throw them away, but even programs that try to account for them get them wrong on some system (or the system changes behavior from out under them…). Is there a library that does this?