I was playing around with some stuff that required a 48GB hash table and, to the very best of my ability to understand this stuff, the run time was completely dominated TLB misses. I say this because, based on my throughput, every lookp was taking the time of about 3 memory accesses on average; i.e. there were page table lookups for every single memory access I made. I don't know the tools that would let me actually monitor the true number of TLB misses.<p>Had I pursued it further, it seems that using a hugepages interface could alleviate this, but hugepages are a royal pain in the ass to get going as they require kernel parameters, rebooting, and special memory allocation routines, and praying that your memory doesn't get fragmented. Of course I was doing this in C, and if my application had been in any other language it may have been extremely difficult to get this to work.<p>My use case may have been unusual, but as we store more and more data in RAM it's going to become less unusual. When we care deeply about the latency it seems that virtual memory pagesize is going to be a big problem, and already it seems that there are few use cases where 4kb pages are large enough.
I was pondering, a while ago, an operating system that--as well as exposing a raw "allocate me a block of memory" function--exposed a managed, typed key-value representation of virtual memory (picture, say, a Redis kernel module), from which one could allocate hashes, trees, linked-lists, and so forth. Given a NUMA architecture, this K-V store could then just be <i>clustered</i> between each memory pool in the same system in exactly the same way (save optimizations) one would cluster it against remote systems.
Just some background - the solution/benchmark spawned from an index lookup latency issue. In our search engine, we generate enormous b-tree indexes and store them in memory (rsync from master then mmap). After adding more logic, intersects, and unions the search engine started to miss SLA.<p>Eventually, we traced the problem back to the additional latency in the vmalloc code path. The get_free_page* API code path had much lower latency and llds was born (llds uses k*alloc which is a wrapper around GFP).<p>Additional use cases where llds is being used is in low-energy compute environments (like SeaMicro machines) where every CPU cycle is expensive due to increased hardware latency.
I remember when there was a webserver in the Linux kernel. However it was considered a bug that you could not do equal performance from userspace and eventually it was removed.<p>Should also be possible to fix for this type of case. Making a kernel module is the easy solution and gives a benchmark though.
Reminds me of exokernels. Being able to freely roll or adapt your own virtual memory management system tuned to your application was one of the signature uses.
i am not able to fully understand what it is shooting for. The README, says it avoids the VM layer (which seems impossible in a pure software solution). The code suggests its merely doing a kmem_cache_zalloc. am i missing something?<p>its true that vm is an overhead now, with infinite/very large memory, the concept of virtual memory is outdated. TLB misses are too high and huge pages just don't cut it. this has been repeated over and over but we need to re-design the VM/hardware to support TLBless access for a portion of memory of the working set size of your primary application.