This is a pretty old argument and IMO it's far out of date/obsolete.<p>Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.<p>As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. <a href="https://slideshot.epfl.ch/play/suri_stonebraker" rel="nofollow noreferrer">https://slideshot.epfl.ch/play/suri_stonebraker</a> (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.<p>Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.
Another interesting limitation of mmap() is that real-world storage volumes can exceed the virtual address space a CPU can address. A 64-bit CPU may have 64-bit pointers but typically cannot address anywhere close to 64 bits of memory, virtually or physically. A normal buffer pool does not have this limitation. You can get EC2 instances on AWS with more direct-attached storage than addressable virtual address space on the local microarchitecture.
Not just databases - we ran into the same issues when we needed a high-performance caching HTTP reverse proxy for a research project. We were just going to drop in Varnish, which is mmap-based, but performance sucked and we had to write our own.<p>Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.
Related:<p><i>Are You Sure You Want to Use MMAP in Your Database Management System? [pdf]</i> - <a href="https://news.ycombinator.com/item?id=31504052">https://news.ycombinator.com/item?id=31504052</a> - May 2022 (43 comments)<p><i>Are you sure you want to use MMAP in your database management system? [pdf]</i> - <a href="https://news.ycombinator.com/item?id=29936104">https://news.ycombinator.com/item?id=29936104</a> - Jan 2022 (127 comments)
Many general-purpose OS abstractions start leaking when you're working on systems-like software.<p>You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.
It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.<p>Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.<p>[1]: <a href="https://lwn.net/Articles/718102/" rel="nofollow noreferrer">https://lwn.net/Articles/718102/</a>
Memory-Mapped Files = access violations when a disk read fails. If you're not prepared to handle those, don't use memory-mapped files. (Access violation exceptions are the same thing that happens when you attempt to read a null pointer)<p>Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.
The TLDR is that MMAP sorta does what you want, but DBMSes need more control over how/when data is paged in/out of memory. Without this extra control, there can be issues with transactional safety, and performance.
For all of its usefulness in the good old days of rusty disks I wonder if virtual memory is worth having for dedicated databases, caches, and storage heads. Avoiding TLB flushes entirely sounds like a huge win for massively multithreaded software and memory management in a large shared flat address space doesn't sound impossibly hard.
I've become convinced that there are very few, if any, reasons to MMAP a file on disk. It seems to simplify things in the common case, but in the end it adds a massive amount of unnecessary complexity.
A well written bespoke function can beat a generalized function at a specific task.<p>If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.