The diagram is a bit misleading. There is no system call (switching to the kernel) with read or write hits, assuming the memory is already mapped, has a page table entry and is resident. You only incur switches when performing the mapping and due to various page fault modes (lazy mapping, non-resident data, copy-on-write needed).<p><i>> This usually happens when the ratio of storage size to RAM size is significantly higher than 1:1. Every page that is brought into cache causes another page to be evicted.</i><p>While true one can optimize these things by unmapping larger ranges in bulk (to reduce cache pressure) or prefetching them (to reduce blocking) with madvise, allowing the kernel doing the loading asynchronously while you're accessing the previously prefetched ones.<p>If you know your read and write patterns very well you can effectively use this for nearly asynchronous IO without the pains of AIO and few to none extra threads.
Very nice job, I like the diagrams and the description.<p>Noticed this bit "block size which is typically 512 or 4096 bytes" and was wondering how would the application know how to align. Does it query the file descriptor for block size? Is there an ioctl call for that?<p>When it comes to IO it's also possible to differentiate between blocking/non-blocking and synchronous/asynchronous, and those categories are orthogonal in general.<p>So there is blocking and synchronous: read, write, readv, writev. The calling thread blocks until data is ready and while it gets copied to user memory.<p>non-blocking synchronous: using non-blocking file descriptors with select/poll/epoll/kqueue. Checking when data is ready is done asynchronously but then read, write still happens inline and the thread waits for it to be copied to user space. This works for socket IO in Linux but not for disk.<p>non-blocking asynchronous: using AIO on Linux. Here both waiting till data is ready to be transferred and transferring happens asynchronously. aio_read returns right way before read finished. Then have to use aio_error to check for its status. This works for disk but not socket IO on Linux.<p>blocking asynchronous: nothing here
> The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache.<p>Some other advantages:<p>The kernel has a global view of what is going on with all the different applications running on the system, whereas your application only knows about itself.<p>The cache can be shared amongst different applications.<p>You can restart applications and the cache will stay warm.
Nice article. I agree with the strategy. No matter how clever the OS is in general, it cannot be more clever than a well designed db kernel.<p>A little off topic, but I have been waiting over a decade for Linux to merge all async waiting into one system call.<p>Wouldn't it be nice if there was a kqueue system call in posix? It would then force Linux to finally implement it.
The author's name, Avi Kivity sounded familiar. Turns he created KVM, the Linux kernel hypervisor.<p><a href="https://il.linkedin.com/in/avikivity" rel="nofollow">https://il.linkedin.com/in/avikivity</a>
Great article for a noob like me. Any similar article suggestions, which gives a good high level overview on kernel internals, specifically on IO topics like what happens when an application requests for a file - right from issuing a system call to disk Interactions.<p>I was always confused - still I am - with regards to different caching layers that involve in an IO operation.
There's also a class of functions that act like read/write, but are done completely in the kernel without returning to userspace (sendfile/(vm)splice/copy_file_range).
They seem to target the situation that the storage-to-memory ratio is very high.<p>But this raises the question: isn't this ratio decreasing with cheaper and cheaper RAM these days?<p>(I've seen a lot of systems already moving to completely in-memory databases, taking this to the extreme; so this is actually a reality)
What are typical use cases for direct I/O? I never stumbled upon anything that used it. Intuitively, I would assume specialized logging applications or databases, but even these seem to use other mechanisms.