I recently discovered "compact approximators", which can be seen as a generalisation of Bloom filters. Instead of telling you "element X is probably in the set", they tell you "a lower bound on the value associated with key X".<p>If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.<p>My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.<p>I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.
Would recommend reading rocksdb implementation of bloom filter and ribbon filter to anyone wanting learn more about the production level implementation side. It is extremely well explained in the comments and is the state of the art implementation as far as I know.<p><a href="https://github.com/facebook/rocksdb/blob/main/util/bloom_impl.h">https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...</a><p><a href="https://github.com/facebook/rocksdb/blob/main/util/ribbon_alg.h">https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...</a>
There’s a useful but non-obvious property of these filters.<p>If you have two Bloom filters which use same set of possible keys and same filter setup, you can compute intersection of their represented sets with a bitwise AND operation over the bitsets in these filters.<p>If you store bitsets aligned and padded by SIMD vector size (for example, an array of __m128i in C++ as opposed to the array of uint64 values in the golang example), computing such intersection is very efficient on modern computers.<p>I once used that technique to filter 3D objects from a large set using the intersection of 3 range queries, one per coordinate. Compute two Bloom filters filled with the objects returned by X and Y queries, intersect the filters, then iterate over the objects returned by the remaining Z query testing against the intersected Bloom filter. Due to probabilistic nature of the Bloom filter you still need to test for X and Y queries for the objects tested positive by the filter, however with proper setup the filter should reject vast majority of them.
The author says that with a 1.2GB bloom filter and 7 hash functions lookup is only 30ns. I have to assume this is because the cache has loaded all the benchmark values. My guess is that the benchmark is only testing a few elements. In real world scenarios with a bloom filter this big most of those 7 hash lookups will be into cold cache since 1.2 GB is too large. That means lookups are much longer than 30ns. Probably still faster than going to network or database though.
Sam Rose has also written a great article on Bloom Filters and has some great animations to help illustrate it too. Definitely worth checking out <a href="https://samwho.dev/bloom-filters/" rel="nofollow">https://samwho.dev/bloom-filters/</a>
Hash functions are kinda magic. Bloom filters are one of the fun tricks you can play with them (though Bloom filters themselves are quite old at this point; check the literature for various alternatives that are better in specific situations.) The field of data sketches and streaming algorithms has many others. k-Minimum values is a nice example of a set cardinality estimator that is very easy to understand.
Related: <a href="https://news.ycombinator.com/item?id=42293832">https://news.ycombinator.com/item?id=42293832</a><p>Also, this -somehow- related <i>Ask HN</i> is a true classic, at least for me: <a href="https://news.ycombinator.com/item?id=32186203">https://news.ycombinator.com/item?id=32186203</a>
Available in Redis: <a href="https://redis.io/docs/stack/bloom/" rel="nofollow">https://redis.io/docs/stack/bloom/</a>
You can even use Bloom Filters for rate limiting with counting BF:<p><a href="https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4fae130c58592bbb1ca295f7296cf90/src/bloom/counting_bloom.h#L610">https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4...</a>
Here's my use case for it: I want really small filters (say 64 bits or 128 bits), and use these to represent 50-100 element subsets of a ~1000-2000 element set.<p>I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?
A whole load of data structures, like bloom filters, are 'efficiency tricks'.<p>For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.<p>I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".<p>Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"