I think part of the problem is that fsync() is that it's an insufficient interface. Most of the time, you want two things:
* write ordering ("write B must hit disk after write A")
* notification ("let me know when write A is on disk")
In particular, you often <i>don't</i> want to actually force I/O to happen immediately, since for performance reasons it's better for the kernel to buffer as much as it wants. In other words, what you want should be nearly free, but instead you have to do a very expensive operation.<p>For an example for notification: suppose I have a temporary file with data that is being journaled into a data store. The operation I want to do is:
1. Apply changes to the store
2. Wait until all of those writes hit disk
3. Delete the temporary file
I don't care if step 2 takes 5 minutes, nor do I want the kernel to schedule my writes in any particular way. If you implement step 2 as a fsync() (or fdatasync()) you're having a potentially huge impact on I/O throughput. I've seen these frequent fsync()s cause 50x performance drops!
It always amazes me that after all these years, Linux still hasn't fixed this.<p>In my experience, any program that overloads I/O will make the system grind to a halt on Linux. Any notion of graceful degradation is gone and your system just thrashes for a while.<p>My theory about this has always been that any I/O related to page faults is starved, which means that every process spends its time slice just trying to swap in its program pages (and evicting other programs from the cache, ensuring that the thrashing will continue).<p>I've never gotten hard data to prove this, and part of me laments that SSDs are "fast enough" that this may never actually get fixed.<p>Can anyone who knows more about this comment? It seems like a good rule inside Linux would be never to evict pages that are mapped executable if you can help it.<p>Has anyone experimented with ionice or iotop? <a href="http://www.electricmonk.nl/log/2012/07/30/setting-io-priorities-on-linux/" rel="nofollow">http://www.electricmonk.nl/log/2012/07/30/setting-io-priorit...</a>
Interesting couple of related articles / rants by Jeff Darcy:<p><a href="http://pl.atyp.us/2013-08-local-filesystems-suck.html" rel="nofollow">http://pl.atyp.us/2013-08-local-filesystems-suck.html</a><p><a href="http://pl.atyp.us/2013-11-fixing-fsync.html" rel="nofollow">http://pl.atyp.us/2013-11-fixing-fsync.html</a>
Good to see this summit involving the kernel developers, since their past situation sounds rather bleak interaction-wise: using kernel version from 2009 and haven't tested the improvements in the (2012) 3.2 kernel.<p>BTW, Linux provides the direct I/O O_DIRECT interface that allows apps to bypass the kernel caching business altogether. This is also discussed in Mel Gorman's message that this blog borrows from.
I think this is a very important area of improvements for linux. While we call it "multitasking" there are a lot of situations where one might doubt it deserves that title.<p>I have been experimenting with very low cost computing setups that optimize for robustness and that led me to pretty slow disk I/O. While thats not a typical scenario for desktop computing, it can and should be possible with the limited but sane resources I ended up with. In practice however, certain loads freeze the whole system until a single usually non-urgent write finishes. Basically the whole throuput is used for a big write and then X (and others) freeze because they are waiting for the filesystem (probably just a stat and similar).<p>There are differences between applications. Some "behave" worse than others. Some even manage to choke themselves (ever seen GIMP take over an hour to write 4MB to an NFS RAID with 128kb/s throughput?).<p>I guess this is a hard problem, but I would wish for an OS to never stall on load. It is even better to slow down exponentially than to halt other tasks. Ideally the sytem would be smart and deprioritize long-running tasks so that small, presumably urgent, tasks are impacted as little as possible.
Re Mel Gorman's details in <a href="http://article.gmane.org/gmane.linux.kernel/1663694" rel="nofollow">http://article.gmane.org/gmane.linux.kernel/1663694</a><p>I don't understand why PostgreSQL people don't want to write their own IO scheduler and buffer management. It's not that hard to implement (even a MT IO+BM is not really complicated), and there are major advantages:<p>- you become truly platform-independent instead of relying on particulars of some kernel [the only thing you need from the OS is some form of O_DIRECT; it exists also on Win32]<p>- you have total control over buffer memory allocation and IO scheduling<p>- whatever scheduling and buffer management policy you're using, you can more easily adapt it to SSDs and other storage types, which are still in their infancy (e.g., memristors) [thus not depending on the kernel developers' goodwill]<p>I mean, really: these pepole have implemented a RDBMS with a bunch of extensions to standard SQL, and IO+buffer management layer is suddenly complicated, or [quote from the link]: "While some database vendors have this option, the Postgres community do not have the resources to implement something of this magnitude."<p>This smells more like politics than a technical issue.
here is a mail from mel gorman with many more details.<p><a href="http://mid.gmane.org/%3C20140310101537.GC10663%40suse.de%3E" rel="nofollow">http://mid.gmane.org/%3C20140310101537.GC10663%40suse.de%3E</a>
Ditch Linux and port PgSQL to run on top of raw Xen interfaces. You get to control your own buffering, worker thread scheduling, and talk directly to the (virtual) disk driver. I believe it'd be a win.
Oh yes, that annoying problem (especially fro MongoDB) that data should eventually be committed to a disk.)<p>Informix (and PostgreSQL) allows DBA to choice "checkpoint/vacuum intervals".<p>The rule of thumb, unless you are a Mongo fan, is that checkpoints should be performed often enough to not take too long, which depends only on the actual insert/update data flow.<p>But any real DBA could tell the same - sync quickly, sync often, so server will run smoothly, but not "at web scale" and the pain of recovery will be less severe.)