Ask HN: Books on designing disk-optimized data structures?

103 点作者 memset超过 2 年前

Are there canonical books, resources, or readings for how to design data structures that will be primarily read and written to a disk rather than memory? Most of what I learned in school about big-O assumes that, for example, random access is O(1). However, random disk reads are really slow due to spacial locality.People who write databases obviously have solutions to this problem - for example, DuckDB is based on a number of papers that have come out over the years on this topic.If I wanted to design, ie, a tree structure which was intended to be read/written from a disk, are there general principles or patterns the have been developed to take advantage of locality of reference, minimize random reads, or decrease the overhead of writes, that I could familiarize myself with?What is the CLRS for disk?

19 条评论

tlb超过 2 年前

These days, most people need the opposite. They're probably using systems optimized for spinning disks but they're running it on flash, and all the layers of complexity added over the years to optimize disk latency are just slowing things down.Like, this is the kind of bullshit people used to do research on: <a href="https://tlb.org/docs/usenixw95.pdf" rel="nofollow">https://tlb.org/docs/usenixw95.pdf</a> (I'm an author). This paper (and 100s of others) exist only because sequential reads & writes on disks were much faster than random access, because you had to (a) move the head, and (b) wait for the sector you're interested in to come around. At 7200 RPM, this is up to 4.2 milliseconds, so it was worth burning 10000s of CPU instructions to try to sort read requests in some order that might avoid waiting for another rotation. Many popular disk controllers couldn't read sequential sectors, so it was better to sort write requests so that it hit every second sector or something. Madness.Anyway, today there are only 3 kinds of secondary storage worth caring about: flash (sector granularity, but no seeks), cloud (like S3), and archive (like Glacier).But if you're working with some old-timey environment and really need to know about optimizing for spinning disks, most of the ideas were published in the late 80s and 90s. This book <a href="https://www.amazon.co.uk/Design-Implementation-Operating-Addison-Wesley-Systems/dp/0201549794" rel="nofollow">https://www.amazon.co.uk/Design-Implementation-Operating-Add...</a> (McKusick et al) is excellent, and describes a filesystem that works well on a wide variety of workloads. Or see the references in the Blackwell paper above.

评论 #32966809 未加载

评论 #32966889 未加载

评论 #32965738 未加载

评论 #32966468 未加载

评论 #32979193 未加载

评论 #32976607 未加载

评论 #32966475 未加载

评论 #32969804 未加载

chubot超过 2 年前

Taking into account the comment by tlb@ about "do you really want that?"If you really want that then I think a good keyword to search for is "External memory algorithms" (or also I think secondary storage), e.g. this 2001 survey:External Memory Algorithms and Data Structures: Dealing with Massive Data<a href="https://users.cs.duke.edu/~reif/courses/alglectures/vitter.papers/Vit.IO_survey.pdf" rel="nofollow">https://users.cs.duke.edu/~reif/courses/alglectures/vitter.p...</a>I think there is at least one other survey out there which would have references. Of course not all these algorithms have actually been tried in production :) Which is what I tend to look forAnd this is pre-cloud, which changes things a lot for "massive data"

avl999超过 2 年前

Designing Data Intensive applications- specifically chapter 3 and 4 which deal with strategies and algorithms for storing and encoding data to be stored on disk and their pros and cons.Once you read that, I'll suggest reading the source of a simple embedded key-value database, I wouldn't bother with RDBMs as they are complex beasts and contain way more than you need. BoltDB is a good project to read the source of <a href="https://github.com/boltdb/bolt" rel="nofollow">https://github.com/boltdb/bolt</a>, the whole thing is <10k lines of code and is a full blown production grade system with ACID semantics so packs a lot in those 10k and isn't just merely a toy.

pknerd超过 2 年前

You might like to visit the code of your favorite DB/Storage engine and see how it works? A couple of years during COVID lockdown I started exploring Golang, I produced two libs: GoCache and Fehrist to understand what algo is used by Memcache and Elasticsearch respectively. Blog posts about them are given below:- <a href="http://blog.adnansiddiqi.me/gocache-lru-cache-implementation-in-go/" rel="nofollow">http://blog.adnansiddiqi.me/gocache-lru-cache-implementation...</a>- <a href="http://blog.adnansiddiqi.me/fehrist-document-indexing-library-in-go/" rel="nofollow">http://blog.adnansiddiqi.me/fehrist-document-indexing-librar...</a>

nindalf超过 2 年前

Designing Data Intensive Applications is the book you want. The first section of the book covers data structures used to store data on disk in detail, specifically B-Trees, LSM Trees and how they’re used by various databases.If you want to dive deeper on the subject of persistence, the book Operating Systems: Three Easy Pieces has a section devoted to it.

aslak超过 2 年前

This is definitely a case where you should look into traditional disk-oriented DBMS architecture.Google the two CMU db courses, advanced and intro, they have the required material and references to understand, it's a case where practice > theoryFor example, to reduce random reads in a b+tree (data structure used for indexes), you leave room for the index data to grow in the node, so your DBMS doesn't need to allocate a new page immediately (this new page read would be a random, non-sequential access on a later read). Google "index fragmentation" to find out more

kadoban超过 2 年前

Just to spice things up a bit, you may be interested in "Cache-oblivious" algorithms and data structures. The name is a bit counter-intuitive, they're data structures that care about memory hierarchies without encoding the specifics about the block sizes/cache lines.

boulos超过 2 年前

Edit: duh, a sibling comment (<a href="https://news.ycombinator.com/item?id=32965989" rel="nofollow">https://news.ycombinator.com/item?id=32965989</a>) did say BTrees and LSMs an hour ago... Sorry.Btw, because nobody has said it explicitly yet: you should start at BTrees [1]. As you guessed, the database folks were the primary "we have to deal with spinning disk and it really matters to be fast" people.There's also the Log-Structured < Noun > universe of things which 'tlb is pointing you at. Roughly, you turn random writes into appends (to a "log"), often followed by periodic compaction to make reads reasonable (since in-place updates to a "file" are now appended at the end of the log, you've suddenly made reads worse).[1] <a href="https://en.m.wikipedia.org/wiki/B-tree" rel="nofollow">https://en.m.wikipedia.org/wiki/B-tree</a>

denniskubes超过 2 年前

I think you may be interested in file structures. There are relatively few books published and they are all 20+ years old but they describe the kind of on disk data structures that power many of today's databases and datastores. For search specific file structures look for information retrieval books and one title called managing gigabytes.

adamgordonbell超过 2 年前

Database Internals published by O'Reilly is specifically about this at least for large parts of the book. It's by a Cassandra contributor, but covers how various databases implement storage layers.

efficax超过 2 年前

<a href="https://www.google.com/books/edition/File_Structures/cqwrnwEACAAJ?hl=en" rel="nofollow">https://www.google.com/books/edition/File_Structures/cqwrnwE...</a> this is classic in the genre, however it predates SSDs, which makes structures like LSM trees and other append-only structures much more optimal for write-heavy workloads (for read-heavy workloads the B-Tree is still the order of the day!)

tanelpoder超过 2 年前

I bought the "Algorithms and Data Structures for Massive Datasets" book recently, haven't had a chance to read it yet. "Massive" datasets should automatically imply using (mostly) on-disk data structures...Edit: add link:<a href="https://www.manning.com/books/algorithms-and-data-structures-for-massive-datasets" rel="nofollow">https://www.manning.com/books/algorithms-and-data-structures...</a>

alexott超过 2 年前

Have you seen Database Internals? <a href="https://www.databass.dev/" rel="nofollow">https://www.databass.dev/</a>

评论 #32965696 未加载

narag超过 2 年前

Sorry if it's dated. I had this at college (a long long time ago):<a href="https://www.amazon.com/File-Structures-2nd-Michael-Folk/dp/0201557134" rel="nofollow">https://www.amazon.com/File-Structures-2nd-Michael-Folk/dp/0...</a>It was very good, describing b-trees and its variants.

DamonHD超过 2 年前

It may be a bit obvious, but (books/papers on) disc filesystems themselves, eg ufs or ext4.

jmpman超过 2 年前

Assuming you’re looking for physical HDD advice. Most opt for a b-tree with the data stored in a scan optimal extent - large enough that the data can be read with a sizable disk spindle efficiency (shoot for 75%+, which today equates to around 16MB on random reads), while maintaining the metadata in a higher cache tier. Typically RAM or SSD. Recently Log Structure Merge arrangements have become more common, but the decision to use one over the other is based upon the read/write ratio, with LSM being favored by write heavy workloads.I’d recommend reading up on WAFL filesystem design. ARIES. WAL. The real race horses of this space today are on HPC file systems. Check out the Top 500 file systems. The new Chinese file systems have some impressive metrics, beating the Intel ones significantly. I believe the Chinese ones are taking advantage of RocksDB (LSM).At a higher level, the outer diameter of an HDD is significantly faster than the inner diameter for sequential reads/writes. The LBA addressing starts at the outer diameter. Typically systems will maintain access metrics on their data, and relocate the hotter data to the Outer, and colder data to the Inner.While accessing the drive, optimizing for reads, using an adaptive prefetch algorithm will maintain sequential disk access patterns without wasting the time of the head dwelling over data that will be discarded.If you have a battery backed write back cache, you now have the luxury of optimizing writes to the disk. To maintain optimal disk write performance, you’ll want to maintain your writes in an ordered manner, and present them to the disk with a high queue depth. Ideally you will take advantage of HDD queues, and send latency sensitive reads to a higher priority queue, or head of queue. Additionally, with write back cache, you have the opportunity to enable write cache on the HDD. Each HDD vendor/model have slightly different write cache handling mechanism, so I’d recommend testing.I’ve been impressed with PingCap’s use of some of the new algorithms. Check out their YouTube architecture overview talks which provide details on exactly which papers/algorithms they’re using.If your goal is to use Shingled Magnetic Recording drives, these optimizations become even more complex, with best practices not yet fully defined.

dapids超过 2 年前

I think most commenters are completely gleaning over the contrived systems which are embedded systems.Sure, on higher performance systems you will be dealing with bigger demons such as cache and TLB performance depending on data size. But many embedded systems are performance limited for cost and power reasons. There is nothing here cloud will solve, nor anything else than more expensive NAND flashes, which require more power, and money. Hence why designing algorithms for critical data throughput are not as simple as using cloud or a filesystem.

trasz超过 2 年前

Anything that describes external sort algorithms perhaps?

rramadass超过 2 年前

Related oldie (but not a DS book) - Memory Systems: Cache, DRAM, Disk by Bruce Jacob et.al.