Wavelet trees are awesome. There are several ways to speed up the rank/select operations, which the article doesn't mention. For example, you can create a simple (popcount, offset) index for each bitmap. Then when you need to find the nth bit (the select operation), you can simply do a binary search to find the offset with the nearest count. Likewise for the access (or the rank operation) you can find the popcount for the nearest offset in the index.<p>I use wavelet tree heavily in my job. They are very useful for compact representations of order preserving collections of strings which supports fast lookups.<p>Update: Sux4j has some interesting implementations of fast select/rank:<p><a href="http://sux.dsi.unimi.it/" rel="nofollow">http://sux.dsi.unimi.it/</a><p><a href="http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Select.html" rel="nofollow">http://sux4j.dsi.unimi.it/docs/it/unimi/dsi/sux4j/bits/Selec...</a>
What kind of compressed data structure I would use depends heavily on so much things (the probability distribution of the data, the probability distribution of different access patterns) that it is very hard to design a data structure for a generic case. Designing for a specific use case is a lot of fun though!<p>For example this:
"Use less memory than an array in all circumstances"
is just plain impossible if your data distribution is uniform. You just cannot store 1000 32 bit random intergers (preserving their order) in a smaller place than 1000*32 bits; (it comes from from simple combinatorics considerations.)
So wait... You have 64 bits keys and 32 bits values. Then why not store the values in the keys? Use the lower 32-bits for the value, and the higher 32 bits to count how many duplicates you have.<p><pre><code> int32_t big_array[2^32] = {0};
/* returns the key */
int64_t write(int_32 value)
{
big_array[value]++;
return big_array[value] << 32 & value;
}
/* returns the value */
int32_t read(int64_t key)
{
return (int32_t) key;
}
/* all indexes between 1 and the return value are valid higher half values
for the key. The lower half is the value itself. */
int32_t seek(int32_t value)
{
return big_array[value];
}
</code></pre>
memory: 16GB<p>access: O(1)<p>seek: O(1)<p>Problem solved!<p>:)<p>edit: I'm missing the "Use less memory than an array in all circumstances" requirement, but hey, you talk about "billion of records" in your post.
Bit confused here --<p>The author rewrites the initial values as:<p>1 0 2 0 2 3<p>He defines the keys array as:<p>2 3 4 6<p>Using the method defined in the article to find the values of each of the elements of the rewritten array, wouldn't the values be:<p>3 2 4 2 4 6<p>which does not equal the original values of:<p>3 2 4 6 2 6
Depending on your data set, there might be a way to use Compressed Bit Vectors: <a href="http://sdm.lbl.gov/fastbit/" rel="nofollow">http://sdm.lbl.gov/fastbit/</a><p>General intro: <a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-2164E.pdf</a><p>"Word Aligned Hybrid" compression:
<a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-49627.pdf</a><p>Bit maps as alternative to inverted index:
<a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-61768.pdf</a><p>Bit maps for range queries:
<a href="http://crd-legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf" rel="nofollow">http://crd-legacy.lbl.gov/~kewu/ps/LBNL-60891.pdf</a>
I've seen slab allocation schemes (not sure if that's a standard term for it) work in scenarios like inverted indicies where you want to keep your datum key size small (32 vs 64bit).<p>The intuition is that you don't need granular memory addressing for storing large chunks. You only need to have addressability at the min(sizeof(doc)) level. So you can usually store the key in a 32bit int or less.<p>More detail can be found here: <a href="https://gist.github.com/1947190" rel="nofollow">https://gist.github.com/1947190</a>
A very interesting data structure. Reading this inspired me to give an implementation a try. It was a lot harder than it seemed at first and it's certainly a naive attempt but if anyone is interested you can take a look at it here:<p><a href="https://gist.github.com/1947371" rel="nofollow">https://gist.github.com/1947371</a>
So you have 64-bit integer keys and 64-bit integer values, and you need to store these key-value pairs such that you get good lookups from key to list of values and from value to list of keys?<p>How often are you updating? What is reading from this data structure? What's the query structure?<p>If you're adding lots of records often, I'd go for a B+tree. There's also a trie (and a B-trie), if your keys share lots of common prefixes.<p>If you are making consecutive queries, then you'd do well to take advantage of what pages are in L1 and L2 cache, but if you're randomly picking things, that's less important.
<i>(particularly in the “refine” and “access” methods, where utilizing the popcnt SSE4.2 instruction would no doubt have a massive impact on performance.</i><p>)