Ask HN: Why are there no open source NVMe-native key value stores in 2023?

99 pointsby nphaseover 1 year ago

Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don't seem to be any open source, non-embedded key value stores or DBs out in the wild yet.Why do you think that is? Are there possibly other projects out there that I'm not familiar with?

22 comments

digganover 1 year ago

I don't remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you're looking for somewhat:- <a href="https://github.com/DataManagementLab/ScaleStore">https://github.com/DataManagementLab/ScaleStore</a> - "A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA"- <a href="https://github.com/unum-cloud/udisk">https://github.com/unum-cloud/udisk</a> (<a href="https://github.com/unum-cloud/ustore">https://github.com/unum-cloud/ustore</a>) - "The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel."- <a href="https://github.com/capsuleman/ssd-nvme-database">https://github.com/capsuleman/ssd-nvme-database</a> - "Columnar database on SSD NVMe"

评论 #37899540 未加载

评论 #37902899 未加载

评论 #37899131 未加载

评论 #37899277 未加载

formerly_provenover 1 year ago

There's actually an NVMe command set which allows you to use the FTL directly as a K/V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it's dead now)[1] These slides claim up to 32 bytes, which would be a practically useful length: <a href="https://www.snia.org/sites/default/files/ESF/Key-Value-Storage-Standard-Final.pdf" rel="nofollow noreferrer">https://www.snia.org/sites/default/files/ESF/Key-Value-Stora...</a> but the current revision of the standard only permits two 64-bit words as the key ("The maximum KV key size is 16 bytes"): <a href="https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Value-Command-Set-Specification-1.0c-2022.10.03-Ratified-1.pdf" rel="nofollow noreferrer">https://nvmexpress.org/wp-content/uploads/NVM-Express-Key-Va...</a>

评论 #37898746 未加载

评论 #37898905 未加载

评论 #37898765 未加载

jiggawattsover 1 year ago

Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.The Azure Lv3/Lsv3/Lav3/Lasv3 series all provide this capability, for example.Ref: <a href="https://learn.microsoft.com/en-us/azure/virtual-machines/lasv3-series" rel="nofollow noreferrer">https://learn.microsoft.com/en-us/azure/virtual-machines/las...</a>

评论 #37900388 未加载

gavinrayover 1 year ago

Why do you mean by non-embedded?You might also be interested in xNVMe and the RocksDB/Ceph KV drivers:<a href="https://github.com/OpenMPDK/xNVMe">https://github.com/OpenMPDK/xNVMe</a><a href="https://github.com/OpenMPDK/KVSSD">https://github.com/OpenMPDK/KVSSD</a><a href="https://github.com/OpenMPDK/KVRocks">https://github.com/OpenMPDK/KVRocks</a>

评论 #37900185 未加载

nerpderp82over 1 year ago

Eatonphil posted a link to this paper <a href="https://web.archive.org/web/20230624195551/https://www.vldb.org/pvldb/vol16/p2090-haas.pdf" rel="nofollow noreferrer">https://web.archive.org/web/20230624195551/https://www.vldb....</a> a couple hours after this post (zero comments [0])> NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.[0] <a href="https://news.ycombinator.com/item?id=37899886">https://news.ycombinator.com/item?id=37899886</a>

threeseedover 1 year ago

Crail [1] which is a distributed K/V store on top of NVMEoF.[1] <a href="https://craillabs.github.io" rel="nofollow noreferrer">https://craillabs.github.io</a>

nerpderp82over 1 year ago

Aerospike does direct NVME access.<a href="https://github.com/aerospike/aerospike-server/blob/master/cf/src/hardware.c#L83">https://github.com/aerospike/aerospike-server/blob/master/cf...</a>There are other occurrences in the codebase, but that is the most prominent one.

bestouffover 1 year ago

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

评论 #37898595 未加载

评论 #37898512 未加载

评论 #37900517 未加载

评论 #37899003 未加载

delfinomover 1 year ago

<a href="https://github.com/OpenMPDK/KVRocks">https://github.com/OpenMPDK/KVRocks</a>Given however, that most of the world has shifted to VMs, I don't think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.

评论 #37898757 未加载

otterleyover 1 year ago

Because you haven't written it yet!

infamouscowover 1 year ago

I work on a database that is a KV-store if you squint enough and we're taking advantage of NVMe.One thing they don't tell you about NVMe is you'll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it's hard to say what the future holds. It's all very exciting.

caerilover 1 year ago

> non-embedded key value stores or DBs out in the wild yetI like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can't access these benefits across a SEVEN LAYER STACK OF ABSTRACTIONS.You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.You don't get to do both simultaneously.Embedded is a feature for performance-aware software, not a bug.

rubiquityover 1 year ago

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

CubsFan1060over 1 year ago

Interesting article here: <a href="https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability/" rel="nofollow noreferrer">https://grafana.com/blog/2023/08/23/how-we-scaled-grafana-cl...</a>Utilizing: <a href="https://memcached.org/blog/nvm-caching/,https://github.com/memcached/memcached/wiki/Extstore" rel="nofollow noreferrer">https://memcached.org/blog/nvm-caching/,https://github.com/m...</a>TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.

评论 #37899121 未加载

javierhonducoover 1 year ago

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB <a href="https://github.com/apache/kvrocks">https://github.com/apache/kvrocks</a>

评论 #37899714 未加载

Already__Takenover 1 year ago

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

espoalover 1 year ago

I'm building one: <a href="https://github.com/yottaStore/yottaStore">https://github.com/yottaStore/yottaStore</a>

zupa-huover 1 year ago

Is there any performance gain over writing append-only data to a file?I mean, using a merkle tree or something like that to make sense of the underlying data.

评论 #37901015 未加载

znpyover 1 year ago

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

altairprimeover 1 year ago

“Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago.Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue?

brightballover 1 year ago

SolidCache and SolidQueue from Rails will be doing that when released.Otherwise though…you have the file system. Is that not enough?

评论 #37898970 未加载

ilytover 1 year ago

It becomes complex when you want to support multiple NVMesEven more complex when you want to have any kind of redundancy, as you'd essentially need to build-in some kind of RAID-like into your database.Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.So you're left with 1% needing that kind of speeds

22 comments

digganover 1 year ago

评论 #37899540 未加载

评论 #37902899 未加载

评论 #37899131 未加载

评论 #37899277 未加载

formerly_provenover 1 year ago

评论 #37898746 未加载

评论 #37898905 未加载

评论 #37898765 未加载

jiggawattsover 1 year ago

评论 #37900388 未加载

gavinrayover 1 year ago

评论 #37900185 未加载

nerpderp82over 1 year ago

threeseedover 1 year ago

Crail [1] which is a distributed K/V store on top of NVMEoF.[1] <a href="https://craillabs.github.io" rel="nofollow noreferrer">https://craillabs.github.io</a>

nerpderp82over 1 year ago

bestouffover 1 year ago

Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

评论 #37898595 未加载

评论 #37898512 未加载

评论 #37900517 未加载

评论 #37899003 未加载

delfinomover 1 year ago

评论 #37898757 未加载

otterleyover 1 year ago

Because you haven't written it yet!

infamouscowover 1 year ago

caerilover 1 year ago

rubiquityover 1 year ago

I think it's mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.

CubsFan1060over 1 year ago

评论 #37899121 未加载

javierhonducoover 1 year ago

There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB <a href="https://github.com/apache/kvrocks">https://github.com/apache/kvrocks</a>

评论 #37899714 未加载

Already__Takenover 1 year ago

A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You're supposed to give it a whole disk to use anyway.

espoalover 1 year ago

I'm building one: <a href="https://github.com/yottaStore/yottaStore">https://github.com/yottaStore/yottaStore</a>

zupa-huover 1 year ago

Is there any performance gain over writing append-only data to a file?I mean, using a merkle tree or something like that to make sense of the underlying data.

评论 #37901015 未加载

znpyover 1 year ago

I often attended a presentation by some presales engineer from Aerospike and IIRC they're doing some nvme-in-userspace stuff.

altairprimeover 1 year ago

brightballover 1 year ago

SolidCache and SolidQueue from Rails will be doing that when released.Otherwise though…you have the file system. Is that not enough?

评论 #37898970 未加载

ilytover 1 year ago