TechEcho

13 comments

jlokierabout 2 years ago

One thing about LSM trees that are implemented with large numbers of large files in a filesystem, such as RocksDB, is that they defer to the filesystem to deal with fragmentation and block lookup isues. That's not actually free.LSM tree descriptions typically imply or say outright that each layer is laid out linearly, written sequentially, and read sequentally for merging. And that looking up a block within a layer is an O(1) operation, doing random access I/O to that location.But really, the underlying filesystem is doing a lot of heavy lifting. It's maintaining the illusion of linear allocation by hiding how the large files are fragmented. That sequential writing is mostly sequential, but typically becomes more fragmented in the filesystem layer as the disk gets closer to full, and over time as various uses of the filesystem mean there are fewer large contiguous regions. More fragmented free space makes the allocation algorithms have to do more work, sometimes more I/O, just to allocate space for the LSM tree's "linear" writes.Lookup of a block inside a layer requires the filesystem to lookup in its extent tree or, with older filesystems, through indirect block lookups. Those are hidden from the LSM tree database, but are not without overhead.Writing sequentially to a layer generally requires the filesystem to update its free space structures as well as its extent tree or indirect blocks.Even a simple operation like the LSM tree database deleting a layer file it has finished with, is not necessarily simple and quick at the filesystem layer.In other words, when analysing performance, filesystems are the unsung heroes underlying some LSM tree databases. Their algorithmic overhead is often not included in the big-O analysis of LSM tree algorithms running over them, but should be, and their behaviour changes as disk space shrinks and over time due to fragmentation.

评论 #35636130 未加载

评论 #35635847 未加载

评论 #35636073 未加载

评论 #35643364 未加载

schwartzieabout 2 years ago

Congratulations to the author for a remarkably clear, easy-to-follow, and informative post. Excellent technical writing!

评论 #35635488 未加载

评论 #35642739 未加载

willvarfarabout 2 years ago

A bit of a tangent, but HNers often have the kind of hands-on experience that's hard to find in internet searches, so I'll ask away :)A long time ago we had a big MySQL tokudb db and were keen to migrate to myrocks. But myrocks put every table into a single big file, rather than a file per partition.The partition-per-file is a big deal if you are retaining N days of data in a DB and every night will be dropping some old day. If your DB stores each partition in separate files, the DB can simply delete them. But if your DB stores all the partitions in a single file, then it will end up having to compact your absolutely massive big dataset. It was completely unworkable for us.Has this changed?

评论 #35638299 未加载

评论 #35643417 未加载

nitinreddy88about 2 years ago

I am looking for optimal storage engine(KV) which can store operational telemetry (temporarily) at source node. As we know, operational telemetry is generated frequently and need to merge similar operations frequently (little compaction). Once it reaches good amount of size (100mb), we can transfer it to dedicated time series database engines through various mechanisms. I am struggling to find a fast, write heavy, memory optimal storage for this.RocksDB seems to fit few boxes but there could be much better solution as we don't need deletes/range scans sort of operations.Any suggestions?

评论 #35636512 未加载

评论 #35638808 未加载

评论 #35635941 未加载

评论 #35635496 未加载

评论 #35635454 未加载

评论 #35637180 未加载

评论 #35659272 未加载

评论 #35636312 未加载

评论 #35635515 未加载

评论 #35637862 未加载

评论 #35635965 未加载

评论 #35636956 未加载

评论 #35635433 未加载

dikeiabout 2 years ago

RocksDB is awesome, though don't use it with regular glibc malloc because it can cause extreme memory fragmentation. Use jemalloc, tcmalloc, or mimalloc: basically any other advance malloc libraries that can effectively reuse memory.

评论 #35635993 未加载

adev_about 2 years ago

RocksDB is an amazing piece of engineering that deserve to be more known.It is battle tested. It does one job and does it well.I have used it in the past as a middleware database taking an average of 2-3k req/sec with over 400 GB of data stored. It works like a charm.If I had a single reproach to do to it, it would be around the instrumentation. It is not that straightforward to get proper metrics and reporting of the internals.

almogabout 2 years ago

You might want to also listen to the SE Daily episode with Dhruba and Igor of RockDB, they cover similar aspects of RocksDB in details: <a href="https://softwareengineeringdaily.com/2019/02/05/rocksdb-with-dhruba-borthakur-and-igor-canadi/" rel="nofollow">https://softwareengineeringdaily.com/2019/02/05/rocksdb-with...</a>

zip1234about 2 years ago

Well written article--clarification on how Meta uses it though. It is not Tao it is ZippyDb: <a href="https://engineering.fb.com/2021/08/06/core-data/zippydb/" rel="nofollow">https://engineering.fb.com/2021/08/06/core-data/zippydb/</a>

评论 #35642085 未加载

评论 #35635536 未加载

评论 #35643510 未加载

评论 #35635508 未加载

jamesatmeetaproabout 2 years ago

Great writing. Looks like it is used extensively in Meta. I heard they even wanted to use it in Cassandra.

评论 #35643464 未加载

polishdude20about 2 years ago

How does flushing in a background process work if it says that it's an embeddable database that's in your application? It says there is no external process so how is there a background process that performs compaction and flushing?

评论 #35635797 未加载

kilotarasabout 2 years ago

> To find a specific key, we could use binary search on the SST file blocks.I don't think it's possible except when both key and value are fixed size (which is not the case in the example shown).

评论 #35653001 未加载

attruttabout 2 years ago

Damn, there's some astonishing writing skills going on here

Tim25659about 2 years ago

Great post on rocksdb

13 comments

jlokierabout 2 years ago

评论 #35636130 未加载

评论 #35635847 未加载

评论 #35636073 未加载

评论 #35643364 未加载

schwartzieabout 2 years ago

Congratulations to the author for a remarkably clear, easy-to-follow, and informative post. Excellent technical writing!

评论 #35635488 未加载

评论 #35642739 未加载

willvarfarabout 2 years ago

评论 #35638299 未加载

评论 #35643417 未加载

nitinreddy88about 2 years ago

评论 #35636512 未加载

评论 #35638808 未加载

评论 #35635941 未加载

评论 #35635496 未加载

评论 #35635454 未加载

评论 #35637180 未加载

评论 #35659272 未加载

评论 #35636312 未加载

评论 #35635515 未加载

评论 #35637862 未加载

评论 #35635965 未加载

评论 #35636956 未加载

评论 #35635433 未加载

dikeiabout 2 years ago

评论 #35635993 未加载

adev_about 2 years ago

almogabout 2 years ago

zip1234about 2 years ago

评论 #35642085 未加载

评论 #35635536 未加载

评论 #35643510 未加载

评论 #35635508 未加载

jamesatmeetaproabout 2 years ago

Great writing. Looks like it is used extensively in Meta. I heard they even wanted to use it in Cassandra.

评论 #35643464 未加载

polishdude20about 2 years ago

评论 #35635797 未加载

kilotarasabout 2 years ago

> To find a specific key, we could use binary search on the SST file blocks.I don't think it's possible except when both key and value are fixed size (which is not the case in the example shown).

评论 #35653001 未加载

attruttabout 2 years ago

Damn, there's some astonishing writing skills going on here

Tim25659about 2 years ago

Great post on rocksdb

How RocksDB Works

13 comments

How RocksDB Works

13 comments