Hi HN,
NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.<p>Why do you think that is? Are there possibly other projects out there that I'm not familiar with?
I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:<p>- <a href="https://github.com/DataManagementLab/ScaleStore">https://github.com/DataManagementLab/ScaleStore</a> - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"<p>- <a href="https://github.com/unum-cloud/udisk">https://github.com/unum-cloud/udisk</a> (<a href="https://github.com/unum-cloud/ustore">https://github.com/unum-cloud/ustore</a>) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."<p>- <a href="https://github.com/capsuleman/ssd-nvme-database">https://github.com/capsuleman/ssd-nvme-database</a> - "Columnar database on SSD NVMe"
There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now)<p>[1] These slides claim up to 32 bytes, which would be a practically useful length: <a href="https://www.snia.org/sites/default/files/ESF/Key-Value-Storage-Standard-Final.pdf" rel="nofollow noreferrer">https://www.snia.org/sites/default/files/ESF/Key-Value-Stora...</a> but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): <a href="https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Value-Command-Set-Specification-1.0c-2022.10.03-Ratified-1.pdf" rel="nofollow noreferrer">https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...</a>
Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.<p>The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.<p>Ref: <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/lasv3-series" rel="nofollow noreferrer">https://learn.microsoft.com/en-us/azure/virtual-machines/las...</a>
Why do you mean by non-embedded?<p>You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:<p><a href="https://github.com/OpenMPDK/xNVMe">https://github.com/OpenMPDK/xNVMe</a><p><a href="https://github.com/OpenMPDK/KVSSD">https://github.com/OpenMPDK/KVSSD</a><p><a href="https://github.com/OpenMPDK/KVRocks">https://github.com/OpenMPDK/KVRocks</a>
Eatonphil posted a link to this paper <a href="https://web.archive.org/web/20230624195551/https://www.vldb.org/pvldb/vol16/p2090-haas.pdf" rel="nofollow noreferrer">https://web.archive.org/web/20230624195551/https://www.vldb....</a> a couple hours after this post (zero comments [0])<p>> NVMe SSDs based on flash are cheap and offer high throughput.
Combining several of these devices into a single server enables
10 million I/O operations per second or more. Our experiments
show that existing out-of-memory database systems and storage
engines achieve only a fraction of this performance. In this work,
we demonstrate that it is possible to close the performance gap
between hardware and software through an I/O optimized storage
engine design. In a heavy out-of-memory setting, where the dataset
is 10 times larger than main memory, our system can achieve more
than 1 million TPC-C transactions per second.<p>[0] <a href="https://news.ycombinator.com/item?id=37899886">https://news.ycombinator.com/item?id=37899886</a>
Crail [1] which is a distributed K/V store on top of NVMEoF.<p>[1] <a href="https://craillabs.github.io" rel="nofollow noreferrer">https://craillabs.github.io</a>
Aerospike does direct NVME access.<p><a href="https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c#L83">https://github.com/aerospike/aerospike-server/blob/master/cf...</a><p>There are other occurrences in the codebase, but that is the most prominent one.
<a href="https://github.com/OpenMPDK/KVRocks">https://github.com/OpenMPDK/KVRocks</a><p>Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.
I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe.<p>One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting.
> non-embedded key value stores or DBs out in the wild yet<p>I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a <i>SEVEN LAYER STACK OF ABSTRACTIONS</i>.<p>You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.<p>You don't get to do both simultaneously.<p>Embedded is a feature for performance-aware software, not a bug.
Interesting article here: <a href="https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability/" rel="nofollow noreferrer">https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...</a><p>Utilizing: <a href="https://memcached.org/blog/nvm-caching/,https://github.com/memcached/memcached/wiki/Extstore" rel="nofollow noreferrer">https://memcached.org/blog/nvm-caching/,https://github.com/m...</a><p>TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.
There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB <a href="https://github.com/apache/kvrocks">https://github.com/apache/kvrocks</a>
A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.
I'm building one:
<a href="https://github.com/yottaStore/yottaStore">https://github.com/yottaStore/yottaStore</a>
Is there any performance gain over writing append-only data to a file?<p>I mean, using a merkle tree or something like that to make sense of the underlying data.
“Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago.<p>Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue?
It becomes complex when you want to support multiple NVMes<p>Even more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.<p>Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.<p>So you're left with 1% needing that kind of speeds