But how, exactly, do databases use mmap?

208 pointsby brunoacover 4 years ago

13 comments

perbuover 4 years ago

The author notices that Bolt doesn't use mmap for writes. The reason is surprisingly simple, once you know how it works. Say you want to overwrite a page at some locations that isn't present in memory. You'd write to it and you'd think that is that. But when this happens the CPU triggers a page fault, the OS steps in and reads the underlying page into memory. It then relinquishes control back to the application. The application then continues to overwrite that page.So for each write that isn't mapped into memory you'll trigger a read. Bad.Early versions of Varnish Cache struggled with this and this was the reason they made a malloc-based backend instead. mmaps are great for reads, but you really shouldn't write through them.

评论 #25884354 未加载

评论 #25887896 未加载

评论 #25883901 未加载

评论 #25883872 未加载

评论 #25885623 未加载

评论 #25885482 未加载

bonziniover 4 years ago

The right answer is that they shouldn't. A database has much more information than the operating system about what, how and when to cache information. Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.The article at <a href="https://www.scylladb.com/2017/10/05/io-access-methods-scylla/" rel="nofollow">https://www.scylladb.com/2017/10/05/io-access-methods-scylla...</a> is a bit old (2017) but it explains the trade-offs

评论 #25882695 未加载

评论 #25882687 未加载

评论 #25883452 未加载

评论 #25882950 未加载

评论 #25886101 未加载

评论 #25882649 未加载

评论 #25883489 未加载

评论 #25885139 未加载

评论 #25884570 未加载

shooover 4 years ago

See also: sublime HQ blog about complexities of shipping a desktop application using mmap [1] and corresponding 200+ comment HN thread [2]:> When we implemented the git portion of Sublime Merge, we chose to use mmap for reading git object files. This turned out to be considerably more difficult than we had first thought. Using mmap in desktop applications has some serious caveats [...]> you can rewrite your code to not use memory mapping. Instead of passing around a long lived pointer into a memory mapped file all around the codebase, you can use functions such as pread to copy only the portions of the file that you require into memory. This is less elegant initially than using mmap, but it avoids all the problems you're otherwise going to have.> Through some quick benchmarks for the way Sublime Merge reads git object files, pread was around ⅔ as fast as mmap on linux. In hindsight it's difficult to justify using mmap over pread, but now the beast has been tamed and there's little reason to change any more.[1] <a href="https://www.sublimetext.com/blog/articles/use-mmap-with-care" rel="nofollow">https://www.sublimetext.com/blog/articles/use-mmap-with-care</a> [2] <a href="https://news.ycombinator.com/item?id=19805675" rel="nofollow">https://news.ycombinator.com/item?id=19805675</a>

PaulHouleover 4 years ago

I like mmap and I don't.It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped -- this isnt a syscall blocking (which you might work around) but rather any attempt to access mapped memory.I like mmap for tasks like seeking into ZIP files, where you can look at the back 1% of the file, then locate and extract one of the subfiles; the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work.

评论 #25883650 未加载

评论 #25882845 未加载

评论 #25883844 未加载

评论 #25882610 未加载

评论 #25882679 未加载

waynesonfireover 4 years ago

Thanks for diving into this DB! I find it interesting that many databases share such similar architectural principles. NIH. It's super fun to build a database so why not.Also, don't beat yourself over how deep you'll be diving into the design. Why apologize for this? Those that want a deep expository would quickly move on.

ameliusover 4 years ago

This is one area where Rust, a modern systems language, has disappointed me. You can't allocate data structures inside mmap'ed areas, and expect them to work when you load them again (i.e., the mmap'ed area's base address might have changed). I hope that future languages take this usecase into account.

评论 #25882948 未加载

评论 #25882597 未加载

评论 #25882875 未加载

评论 #25884622 未加载

评论 #25882985 未加载

评论 #25884632 未加载

评论 #25886185 未加载

boxfireover 4 years ago

Very strange to see few to no references to io_uring here. I guess it's still too new. As I've seen many times before so much complexity is replicated in userspace to reproduce kernel behavior out of mmap or DIO/AIO, in order to break the latency, caching, and prioritization into a micromanaged state tuned for a narrow set of applications... Then applied to database code used in a myriad of applications which violate those assumptions and have their own needs. io_uring can't take over fast enough.

评论 #25887296 未加载

minitoarover 4 years ago

Interana mmaps the heck out of stuff. I’ve found that relying on the file cache works great. Though our access patterns are admittedly pretty simple.

rossmohaxover 4 years ago

mmap is not as free as people think. VM subsystem is full of inefficient locks. Here is a very good writeup on a problem BBC encountered with Varnish: <a href="https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-49d5-be14-86e7a1dcde04" rel="nofollow">https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4...</a>

评论 #25887531 未加载

rcgortonover 4 years ago

I found some of the 'sizing' snippets in the example came across as disingenuous: if you KNOW the size of the file, mmap it initially using that without the looping overhead. And you presumably know how much memory you have on a given system. The description (at least as how I read the article) implies bolt is a truly naive implementation of a key/value DB

ramozover 4 years ago

Perhaps a part 2 would dive a bit deeper into os caching and hardware (SSDs, their interfaces etc)

jeffbeeover 4 years ago

Apparently in a way that the author of the article, and probably the authors of bolt, do not really understand.

29athrowawayover 4 years ago

malloc is implemented using mmap.You map memory manually when you need very low level control over memory.

评论 #25884823 未加载