The author notices that Bolt doesn't use mmap for writes. The reason is surprisingly simple, once you know how it works. Say you want to overwrite a page at some locations that isn't present in memory. You'd write to it and you'd think that is that. But when this happens the CPU triggers a page fault, the OS steps in and reads the underlying page into memory. It then relinquishes control back to the application. The application then continues to overwrite that page.<p>So for each write that isn't mapped into memory you'll trigger a read. Bad.<p>Early versions of Varnish Cache struggled with this and this was the reason they made a malloc-based backend instead. mmaps are great for reads, but you really shouldn't write through them.
The right answer is that they shouldn't. A database has much more information than the operating system about what, how and when to cache information. Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.<p>The article at <a href="https://www.scylladb.com/2017/10/05/io-access-methods-scylla/" rel="nofollow">https://www.scylladb.com/2017/10/05/io-access-methods-scylla...</a> is a bit old (2017) but it explains the trade-offs
See also: sublime HQ blog about complexities of shipping a desktop application using mmap [1] and corresponding 200+ comment HN thread [2]:<p>> When we implemented the git portion of Sublime Merge, we chose to use mmap for reading git object files. This turned out to be considerably more difficult than we had first thought. Using mmap in desktop applications has some serious caveats [...]<p>> you can rewrite your code to not use memory mapping. Instead of passing around a long lived pointer into a memory mapped file all around the codebase, you can use functions such as pread to copy only the portions of the file that you require into memory. This is less elegant initially than using mmap, but it avoids all the problems you're otherwise going to have.<p>> Through some quick benchmarks for the way Sublime Merge reads git object files, pread was around ⅔ as fast as mmap on linux. In hindsight it's difficult to justify using mmap over pread, but now the beast has been tamed and there's little reason to change any more.<p>[1] <a href="https://www.sublimetext.com/blog/articles/use-mmap-with-care" rel="nofollow">https://www.sublimetext.com/blog/articles/use-mmap-with-care</a>
[2] <a href="https://news.ycombinator.com/item?id=19805675" rel="nofollow">https://news.ycombinator.com/item?id=19805675</a>
I like mmap and I don't.<p>It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped -- this isnt a syscall blocking (which you might work around) but rather any attempt to access mapped memory.<p>I like mmap for tasks like seeking into ZIP files, where you can look at the back 1% of the file, then locate and extract one of the subfiles; the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work.
Thanks for diving into this DB! I find it interesting that many databases share such similar architectural principles. NIH. It's super fun to build a database so why not.<p>Also, don't beat yourself over how deep you'll be diving into the design. Why apologize for this? Those that want a deep expository would quickly move on.
This is one area where Rust, a modern systems language, has disappointed me. You can't allocate data structures inside mmap'ed areas, and expect them to work when you load them again (i.e., the mmap'ed area's base address might have changed). I hope that future languages take this usecase into account.
Very strange to see few to no references to io_uring here. I guess it's still too new. As I've seen many times before so much complexity is replicated in userspace to reproduce kernel behavior out of mmap or DIO/AIO, in order to break the latency, caching, and prioritization into a micromanaged state tuned for a narrow set of applications... Then applied to database code used in a myriad of applications which violate those assumptions and have their own needs. io_uring can't take over fast enough.
mmap is not as free as people think. VM subsystem is full of inefficient locks. Here is a very good writeup on a problem BBC encountered with Varnish: <a href="https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-49d5-be14-86e7a1dcde04" rel="nofollow">https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4...</a>
I found some of the 'sizing' snippets in the example came across as disingenuous: if you KNOW the size of the file, mmap it initially using that without the looping overhead. And you presumably know how much memory you have on a given system.
The description (at least as how I read the article) implies bolt is a truly naive implementation of a key/value DB