I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics. Often they involve specific knowledge of the hardware instead of treating software as abstract. To me this is the big difference between “computer science” and “computer engineering”.
This is an excellent write up on Roaring Bitmap. The concrete examples really help to illustrate the algorithm well.<p>While Roaring Bitmap performs well on space saving with good performance, my gut feeling is there're better ways to achieve the same goals, especially it involving sorting on insertion and binary search on lookup, both on the sorted top level list and the sorted lower-bit integers.<p>Also it works well on 32-bit integers and probably 64-bit integers but it's not clear it can scale well beyond to bigger integers.<p>Nevertheless it's an excellent algorithm that works well in practice.
Not .bmp bitmaps used to store photographic images as I would have guessed:<p>> Bitmaps are arrays of bits used to store sets of integers.
They work by setting the Nth bit when an integer N is in the set
> When a set is sparse, traditional bitmaps compress poorly.<p>They waste space. But if you were to compress them, they'd compress <i>very</i> well. As in, for example, if you were to compress them using roaring bitmaps.
I thought the trick was going to be indirection.<p>sorted-list/bitmap/runs, in a two level tree. cool.<p>it's technically missing sorted compliment-list, i.e. only the items that are missing, although worst case runs only use twice as much space, so more complexity without much savings, esp. considering real workloads<p>performs better with sequential ids than random uuids because you use fewer pages<p>further research<p>investigating the effect of adding more layers to the tree<p>using additional set representations e.g. balanced tree
Those who do not know judy1 are doomed to reinvent it.<p><a href="http://judy.sourceforge.net/" rel="nofollow">http://judy.sourceforge.net/</a>
Bitmaps are amazing and we believe that they will replace all other data formats for analytics.<p>Today we open sourced about $25m in R&D (more to come)! They take a bit of harnessing, but once you do the results are amazing. We have released an OLAP database called FeatureBase, built entirely on bitmaps (well bitmaps in a b-tree)...<p>Here is a bit (pun intended) more about our big bet on bitmaps: <a href="https://www.featurebase.com/blog/bitmaps-making-real-time-analytics-real" rel="nofollow">https://www.featurebase.com/blog/bitmaps-making-real-time-an...</a>
Can someone explain how this is laid out in memory and what it would look like serialized? It's easy to have an array of N size that fits all bits and just write to disk. You can also mmap a file as a large array of bytes and just mutate it and let the OS handle the disk syncs if you're okay with some data loss of recent state changes if the machine crashes.<p>What would an array of pointers to X containers which are N size each look like on disk?
I enjoyed this walkthrough. I'm interested in RB because it's used in Lucene but never dug in and assumed (without much thought) that it had to do with consecutive runs of 1s and 0s. Wrong!
It looks a bit too simplistic: Blocks with up to 4k items are stored as sorted lists. Having to insert in a sorted list up to 4k items seems very expensive to me.
I love roaringbitmaps and maybe even over use this library. On the other hand it made a lot of things possible for legacy.uniprot.org over the years that would have been unaffordable otherwise. Quick faceting using permanent filters over 500+ million Documents would have been really hard otherwise.