PGM Indexes: Learned indexes that match B-tree performance with 83x less space

583 点作者 hbrundage超过 4 年前

23 条评论

jltsiren超过 4 年前

Some context: A few years ago, there was a paper from Google (<a href="https://dl.acm.org/doi/10.1145/3183713.3196909" rel="nofollow">https://dl.acm.org/doi/10.1145/3183713.3196909</a>) that made learned data structures popular for a while. They started from the idea that indexes such as B-trees approximate an increasing function with one-sided error. By using that perspective and allowing two-sided error, they were able to make the index very small (and consequently quite fast).Many data structure researchers got interested in the idea and developed a number of improvements. The PGM-index is one of those. Its main idea is to use piecewise linear approximations (that can be built in a single quick pass over the data) instead of the machine learning black box the Google paper was using.

评论 #25902988 未加载

评论 #25902476 未加载

评论 #25906321 未加载

inciampati超过 4 年前

This is a major practical advance from the succinct data structure community.This community has produced so many brilliant results in the past years. But, they work in the shadows. Since the rise of interest in neural network methods, I've often described their work as "machine learning where epsilon goes to 0." It's not sexy, but it is extremely useful.For instance, Ferragina previously helped to develop the FM-index that enabled the sequence alignment algorithms used for the primary analysis of short genomic reads (100-250bp). These tools were simply transformative, because they reduced the amount of memory required to write genome mappers by orders of magnitude, allowing the construction of full-text indexes of the genome on what was then (~2009) commodity hardware.

joe_the_user超过 4 年前

I don't get it. I've implemented B-trees. The majority of space the used by a B-tree is the data itself. Each N-ary leaf of the tree is a basically a vector of data with maybe some bookkeeping at the ends. The leaves are more than half of the tree.Sure, you can compress the data. But that depends on the data, completely random data can't be compress. Other data can be. But a point blank 83x space claim seems bizarre - or it's comparing to a very inefficient implementation of a B-tree.Edit: It seems the 83x claim is a product of the HN submission. I could not find it on the page. But even the page should say something like "a compressed index that allows full speed look-up" (akin to succinct data structures) and then it would make sense.

评论 #25902945 未加载

评论 #25900857 未加载

评论 #25900707 未加载

评论 #25900840 未加载

评论 #25901234 未加载

评论 #25901995 未加载

评论 #25900760 未加载

thesz超过 4 年前

Their slides <a href="https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf" rel="nofollow">https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf</a> about PGM index, page 21.They stop at page size of 1024 bytes - that indicates they are tested in-memory situation. And, which is worse, their compression ratio advantage almost halves when block size is doubled. Thus, what about B-tree with blocks of 16K or even 256K?Also, what about log-structured merge trees where bigger levels can use bigger pages and, which is quite important, these bigger levels can be constructed using (partial) data scan. These bigger levels can (and should) be immutable, which enables simple byte slicing of keys and RLE compression.So, where's a comparison with more or less contemporary data structures and algorithms? Why beat half a century old data structure using settings of said data structure that favors your approach?My former colleague once said "give your baseline some love and it will surprise you". I see no love for B-trees in the PGM work.

评论 #25903851 未加载

gvinciguerra超过 4 年前

Hello everyone. I'm Giorgio, the co-author of the PGM-index paper together with Paolo Ferragina.First of all, I'd like to thank @hbrundage for sharing our work here and also all those interested in it. I'll do my best to answer any doubt in this thread.Also, I'd like to mention two other related papers:- "Why are learned indexes so effective?" presented at ICML 20, and co-authored with Paolo Ferragina and Fabrizio Lillo.PDF, slides and video: <a href="http://pages.di.unipi.it/vinciguerra/publication/learned-indexes-effectiveness/" rel="nofollow">http://pages.di.unipi.it/vinciguerra/publication/learned-ind...</a>TL;DR: In the VLDB 20 paper, we proved a (rather pessimistic) statement that "the PGM-index has the same worst-case query and space bounds of B-trees". Here, we show that actually, under some general assumptions on the input data, the PGM-index improves the space bounds of B-trees from O(n/B) to O(n/B^2) with high probability, where B is the disk page size.- "A 'learned' approach to quicken and compress rank/select dictionaries" presented at ALENEX 21, and co-authored with Antonio Boffa and Paolo Ferragina.PDF and code: <a href="http://pages.di.unipi.it/vinciguerra/publication/learned-rank-select/" rel="nofollow">http://pages.di.unipi.it/vinciguerra/publication/learned-ran...</a>TL;DR: You can use piecewise linear approximations to compress not only the index but the data too! We present a compressed bitvector/container supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures.

评论 #25901801 未加载

评论 #25902019 未加载

评论 #25901511 未加载

评论 #25901573 未加载

zupa-hu超过 4 年前

The slides: <a href="https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf" rel="nofollow">https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf</a>It seems they are only talking about compressing the index (keys) not the values.Also, the slides seem to imply the keys need to be set in sorted order? That way their memory locations will be in increasing order too. That’s quite an important limitation, that means the index is read-only in practice once populated. Though it may still be useful in some cases.Did I misunderstand?

评论 #25902096 未加载

评论 #25901045 未加载

jabberwcky超过 4 年前

Only watched the video, was disappointed by <a href="https://youtu.be/gCKJ29RaggU?t=408" rel="nofollow">https://youtu.be/gCKJ29RaggU?t=408</a> , where they are comparing against tiny b*tree page sizes that nothing uses any more - 4k, 16k and 64k are way more common

评论 #25902027 未加载

评论 #25900890 未加载

评论 #25900917 未加载

whyuselearning超过 4 年前

Why use learning when you can fit?<a href="http://databasearchitects.blogspot.com/2019/05/why-use-learning-when-you-can-fit.html" rel="nofollow">http://databasearchitects.blogspot.com/2019/05/why-use-learn...</a>

评论 #25906101 未加载

etaioinshrdlu超过 4 年前

How would one (very roughly) approximate what this index does in terms of big-O notation for time and space? Is it the same as a b-tree in time but with linearly less space?

评论 #25900166 未加载

评论 #25901889 未加载

byteshift超过 4 年前

For a detailed study of learned indexes, see this work: <a href="https://vldb.org/pvldb/vol14/p1-marcus.pdf" rel="nofollow">https://vldb.org/pvldb/vol14/p1-marcus.pdf</a>All code is available in open source: <a href="https://github.com/learnedsystems/SOSD" rel="nofollow">https://github.com/learnedsystems/SOSD</a>

jokoon超过 4 年前

I've already thought about the idea of making statistics to optimize access time, so I guess this a viable implementation to do it correctly.That's pretty amazing... I can somehow imagine this tech landing on every modern computer, allowing users to search for anything that is on their machine.

fulafel超过 4 年前

Many devs are probably familiar with perfect hashes as the gperf tool seems omnipresent on Linux machines. Is this a related concept? The learning part makes me suspect so but the slopes and interpolation part makes me doubt it.

crazypython超过 4 年前

This is interesting. Could this be adapted to store 2D data, like how a quadtree is a 2D range tree? (If you link me to a paper / pseudocode for that, I could implement it.) I imagine it would be useful in GIS, gaming, etc.

评论 #25906056 未加载

评论 #25906007 未加载

评论 #25932347 未加载

评论 #25903306 未加载

xucheng超过 4 年前

Also, a learned index from Microsoft: <a href="https://github.com/microsoft/ALEX" rel="nofollow">https://github.com/microsoft/ALEX</a>

legulere超过 4 年前

In the example they sort the data array. Does that mean this works just on sorted arrays? Insert and delete performance would be horrible I guess.

评论 #25901426 未加载

评论 #25899937 未加载

评论 #25901962 未加载

评论 #25900179 未加载

评论 #25900312 未加载

saurabhnanda超过 4 年前

Any Postgres implementation of this yet?

评论 #25900521 未加载

midjji超过 4 年前

Could this be used for/generalized for Nd spatial proximity lookup tables?

评论 #25902873 未加载

评论 #25901847 未加载

gigatexal超过 4 年前

Any chance it will make its way to Postgres?

评论 #25901088 未加载

est超过 4 年前

Reminds me of TokuDB. What happened to it?

评论 #25900525 未加载

dmos62超过 4 年前

I've only heard of B-trees in passing. In what kinds of situations are these data structures used?

评论 #25902507 未加载

mooneater超过 4 年前

And here I was thinking probabilistic graphical models were finally getting the spotlight :)

The_rationalist超过 4 年前

I wonder if databases and/or browsers will make use of it

harperlee超过 4 年前

They should have chosen another name, the acronym PGM already stands for Probabilistic Graphical Model and they overlap in possible usages.

评论 #25900703 未加载

评论 #25900123 未加载

评论 #25902759 未加载

评论 #25900222 未加载

评论 #25905218 未加载