> We haven't discussed other solutions like Partitioned Elias-Fano indexes or Tree-Encoded Bitmaps. We simply did not investigate them. There are no efficient off-the-shelf implementations of these solutions, so we had to implement one.<p>I guess they mean there are no implementations <i>in Rust</i>? The Partitioned Elias-Fano (PEF) paper links to the C++ implementation ([1], which I wrote) used for the experiments, it was efficient 9 years ago :) The library was later forked [2] and used in a few other papers, I believe it is somewhat maintained.
The EF core algorithm implemented in folly [3] may be a bit faster, and implementing partitioning on top of that is relatively easy.<p>It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps), PEF would likely be faster, if dense (visit a large fraction of the bitmap), it'd be slower.<p>It is possible to squeeze a bit more compression out of PEF by introducing a chunk type for Elias-Fano of the chunk complement (for very dense chunks), but you lose the operation of skipping to a given position, which is however not needed in inverted indexes (you only need to skip past a given id, and that can be supported efficiently). That is not mentioned in the paper because at the time I thought the skip-to-position operation was a non-negotiable.<p>[1] <a href="https://github.com/ot/ds2i/">https://github.com/ot/ds2i/</a><p>[2] <a href="https://github.com/pisa-engine/pisa">https://github.com/pisa-engine/pisa</a><p>[3] <a href="https://github.com/facebook/folly/blob/main/folly/experimental/EliasFanoCoding.h">https://github.com/facebook/folly/blob/main/folly/experiment...</a>
Also known as rank/select dictionaries: <a href="https://en.wikipedia.org/wiki/Succinct_data_structure#Succinct_indexable_dictionaries" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Succinct_data_structure#Succin...</a>
Interesting. I once made something like this for a search engine I built. It turned out to be a very useful data structure, especially when working with large, sparse sets.