TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Why are there no open source NVMe-native key value stores in 2023?

99 pointsby nphaseover 1 year ago
Hi HN, NVMe disks, when addressed natively in userland, offer massive performance improvements compared to other forms of persistent storage. However, in spite of the existence of projects like SPDK and SplinterDB, there don&#x27;t seem to be any open source, non-embedded key value stores or DBs out in the wild yet.<p>Why do you think that is? Are there possibly other projects out there that I&#x27;m not familiar with?

22 comments

digganover 1 year ago
I don&#x27;t remember exactly why I have any of them saved, but these are some experimental data stores that seems to be fitting what you&#x27;re looking for somewhat:<p>- <a href="https:&#x2F;&#x2F;github.com&#x2F;DataManagementLab&#x2F;ScaleStore">https:&#x2F;&#x2F;github.com&#x2F;DataManagementLab&#x2F;ScaleStore</a> - &quot;A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA&quot;<p>- <a href="https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;udisk">https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;udisk</a> (<a href="https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;ustore">https:&#x2F;&#x2F;github.com&#x2F;unum-cloud&#x2F;ustore</a>) - &quot;The fastest ACID-transactional persisted Key-Value store designed for NVMe block-devices with GPU-acceleration and SPDK to bypass the Linux kernel.&quot;<p>- <a href="https:&#x2F;&#x2F;github.com&#x2F;capsuleman&#x2F;ssd-nvme-database">https:&#x2F;&#x2F;github.com&#x2F;capsuleman&#x2F;ssd-nvme-database</a> - &quot;Columnar database on SSD NVMe&quot;
评论 #37899540 未加载
评论 #37902899 未加载
评论 #37899131 未加载
评论 #37899277 未加载
formerly_provenover 1 year ago
There&#x27;s actually an NVMe command set which allows you to use the FTL directly as a K&#x2F;V store. (This is limited to 16-byte keys [1] however, so it is not that useful and probably not implemented anywhere, my guess is Samsung looked at this for some hyperscaler, whipped up a prototype in their customer-specific firmware and the benefits were lesser than expected so it&#x27;s dead now)<p>[1] These slides claim up to 32 bytes, which would be a practically useful length: <a href="https:&#x2F;&#x2F;www.snia.org&#x2F;sites&#x2F;default&#x2F;files&#x2F;ESF&#x2F;Key-Value-Storage-Standard-Final.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.snia.org&#x2F;sites&#x2F;default&#x2F;files&#x2F;ESF&#x2F;Key-Value-Stora...</a> but the current revision of the standard only permits two 64-bit words as the key (&quot;The maximum KV key size is 16 bytes&quot;): <a href="https:&#x2F;&#x2F;nvmexpress.org&#x2F;wp-content&#x2F;uploads&#x2F;NVM-Express-Key-Value-Command-Set-Specification-1.0c-2022.10.03-Ratified-1.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;nvmexpress.org&#x2F;wp-content&#x2F;uploads&#x2F;NVM-Express-Key-Va...</a>
评论 #37898746 未加载
评论 #37898905 未加载
评论 #37898765 未加载
jiggawattsover 1 year ago
Note that some cloud VM types expose entire NVMe drives as-is directly the guest operating system without hypervisors or other abstractions in the way.<p>The Azure Lv3&#x2F;Lsv3&#x2F;Lav3&#x2F;Lasv3 series all provide this capability, for example.<p>Ref: <a href="https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;azure&#x2F;virtual-machines&#x2F;lasv3-series" rel="nofollow noreferrer">https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;azure&#x2F;virtual-machines&#x2F;las...</a>
评论 #37900388 未加载
gavinrayover 1 year ago
Why do you mean by non-embedded?<p>You might also be interested in xNVMe and the RocksDB&#x2F;Ceph KV drivers:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;xNVMe">https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;xNVMe</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVSSD">https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVSSD</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVRocks">https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVRocks</a>
评论 #37900185 未加载
nerpderp82over 1 year ago
Eatonphil posted a link to this paper <a href="https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20230624195551&#x2F;https:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol16&#x2F;p2090-haas.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;web.archive.org&#x2F;web&#x2F;20230624195551&#x2F;https:&#x2F;&#x2F;www.vldb....</a> a couple hours after this post (zero comments [0])<p>&gt; NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I&#x2F;O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I&#x2F;O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.<p>[0] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37899886">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37899886</a>
threeseedover 1 year ago
Crail [1] which is a distributed K&#x2F;V store on top of NVMEoF.<p>[1] <a href="https:&#x2F;&#x2F;craillabs.github.io" rel="nofollow noreferrer">https:&#x2F;&#x2F;craillabs.github.io</a>
nerpderp82over 1 year ago
Aerospike does direct NVME access.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;aerospike&#x2F;aerospike-server&#x2F;blob&#x2F;master&#x2F;cf&#x2F;src&#x2F;hardware.c#L83">https:&#x2F;&#x2F;github.com&#x2F;aerospike&#x2F;aerospike-server&#x2F;blob&#x2F;master&#x2F;cf...</a><p>There are other occurrences in the codebase, but that is the most prominent one.
bestouffover 1 year ago
Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?
评论 #37898595 未加载
评论 #37898512 未加载
评论 #37900517 未加载
评论 #37899003 未加载
delfinomover 1 year ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVRocks">https:&#x2F;&#x2F;github.com&#x2F;OpenMPDK&#x2F;KVRocks</a><p>Given however, that most of the world has shifted to VMs, I don&#x27;t think KV storage is accessible for that reason alone because the disks are often split out to multiple users. So the overall demand for this would be low.
评论 #37898757 未加载
otterleyover 1 year ago
Because you haven&#x27;t written it yet!
infamouscowover 1 year ago
I work on a database that is a KV-store if you squint enough and we&#x27;re taking advantage of NVMe.<p>One thing they don&#x27;t tell you about NVMe is you&#x27;ll end up bottlenecked on CPU and memory bandwidth if you do it right. The problem is after eliminating all of the speed bumps in your IO pathway, you have a vertical performance mountain face to climb. People are just starting to run into these problems, so it&#x27;s hard to say what the future holds. It&#x27;s all very exciting.
caerilover 1 year ago
&gt; non-embedded key value stores or DBs out in the wild yet<p>I like how you reference the performance benefits of NVMe direct addressing, but then immediately lament that you can&#x27;t access these benefits across a <i>SEVEN LAYER STACK OF ABSTRACTIONS</i>.<p>You can either lament the dearth of userland direct-addressable performant software, OR lament the dearth of convenient network APIs that thrash your cache lines and dramatically increase your access latency.<p>You don&#x27;t get to do both simultaneously.<p>Embedded is a feature for performance-aware software, not a bug.
rubiquityover 1 year ago
I think it&#x27;s mostly because while the internal parallelism of NVMe is fantastic our logical use of them is still largely sequential.
CubsFan1060over 1 year ago
Interesting article here: <a href="https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2023&#x2F;08&#x2F;23&#x2F;how-we-scaled-grafana-cloud-logs-memcached-cluster-to-50tb-and-improved-reliability&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;grafana.com&#x2F;blog&#x2F;2023&#x2F;08&#x2F;23&#x2F;how-we-scaled-grafana-cl...</a><p>Utilizing: <a href="https:&#x2F;&#x2F;memcached.org&#x2F;blog&#x2F;nvm-caching&#x2F;,https:&#x2F;&#x2F;github.com&#x2F;memcached&#x2F;memcached&#x2F;wiki&#x2F;Extstore" rel="nofollow noreferrer">https:&#x2F;&#x2F;memcached.org&#x2F;blog&#x2F;nvm-caching&#x2F;,https:&#x2F;&#x2F;github.com&#x2F;m...</a><p>TLDR; Grafana Cloud needed tons of Caching, and it was expensive. So they used extstore in memcache to hold most of it on NVMe disks. This massively reduced their costs.
评论 #37899121 未加载
javierhonducoover 1 year ago
There’s Kvrocks. It uses the Redis protocol and it’s built on RocksDB <a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;kvrocks">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;kvrocks</a>
评论 #37899714 未加载
Already__Takenover 1 year ago
A seaweedFS volume store sounds like a good candidate to split some of the performance volumes across the nvme queues. You&#x27;re supposed to give it a whole disk to use anyway.
espoalover 1 year ago
I&#x27;m building one: <a href="https:&#x2F;&#x2F;github.com&#x2F;yottaStore&#x2F;yottaStore">https:&#x2F;&#x2F;github.com&#x2F;yottaStore&#x2F;yottaStore</a>
zupa-huover 1 year ago
Is there any performance gain over writing append-only data to a file?<p>I mean, using a merkle tree or something like that to make sense of the underlying data.
评论 #37901015 未加载
znpyover 1 year ago
I often attended a presentation by some presales engineer from Aerospike and IIRC they&#x27;re doing some nvme-in-userspace stuff.
altairprimeover 1 year ago
“Lazyweb, find me an NVMe key-value store” is how we phrased requests like this twenty years ago.<p>Who could afford to develop and maintain such a niche thing, in today’s economy, without either a universal basic income or a “non-free” license to guarantee revenue?
brightballover 1 year ago
SolidCache and SolidQueue from Rails will be doing that when released.<p>Otherwise though…you have the file system. Is that not enough?
评论 #37898970 未加载
ilytover 1 year ago
It becomes complex when you want to support multiple NVMes<p>Even more complex when you want to have any kind of redundancy, as you&#x27;d essentially need to build-in some kind of RAID-like into your database.<p>Also few terabytes in RAID10 NVMes + PostgreSQL and something covers about 99% of companies needs for speed.<p>So you&#x27;re left with 1% needing that kind of speeds