Bloom Filters

240 pointsby mfrw20 days ago

15 comments

I recently discovered "compact approximators", which can be seen as a generalisation of Bloom filters. Instead of telling you "element X is probably in the set", they tell you "a lower bound on the value associated with key X".If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.

评论 #43872775 未加载

评论 #43879676 未加载

评论 #43871815 未加载

评论 #43870273 未加载

评论 #43868486 未加载

ozgrakkurt20 days ago

Would recommend reading rocksdb implementation of bloom filter and ribbon filter to anyone wanting learn more about the production level implementation side. It is extremely well explained in the comments and is the state of the art implementation as far as I know.<a href="https://github.com/facebook/rocksdb/blob/main/util/bloom_impl.h">https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...</a><a href="https://github.com/facebook/rocksdb/blob/main/util/ribbon_alg.h">https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...</a>

评论 #43876458 未加载

Const-me19 days ago

There’s a useful but non-obvious property of these filters.If you have two Bloom filters which use same set of possible keys and same filter setup, you can compute intersection of their represented sets with a bitwise AND operation over the bitsets in these filters.If you store bitsets aligned and padded by SIMD vector size (for example, an array of __m128i in C++ as opposed to the array of uint64 values in the golang example), computing such intersection is very efficient on modern computers.I once used that technique to filter 3D objects from a large set using the intersection of 3 range queries, one per coordinate. Compute two Bloom filters filled with the objects returned by X and Y queries, intersect the filters, then iterate over the objects returned by the remaining Z query testing against the intersected Bloom filter. Due to probabilistic nature of the Bloom filter you still need to test for X and Y queries for the objects tested positive by the filter, however with proper setup the filter should reject vast majority of them.

评论 #43873100 未加载

评论 #43876852 未加载

celeritascelery20 days ago

The author says that with a 1.2GB bloom filter and 7 hash functions lookup is only 30ns. I have to assume this is because the cache has loaded all the benchmark values. My guess is that the benchmark is only testing a few elements. In real world scenarios with a bloom filter this big most of those 7 hash lookups will be into cold cache since 1.2 GB is too large. That means lookups are much longer than 30ns. Probably still faster than going to network or database though.

评论 #43869378 未加载

评论 #43869241 未加载

评论 #43869259 未加载

评论 #43887790 未加载

aranw20 days ago

Sam Rose has also written a great article on Bloom Filters and has some great animations to help illustrate it too. Definitely worth checking out <a href="https://samwho.dev/bloom-filters/" rel="nofollow">https://samwho.dev/bloom-filters/</a>

noelwelsh20 days ago

Hash functions are kinda magic. Bloom filters are one of the fun tricks you can play with them (though Bloom filters themselves are quite old at this point; check the literature for various alternatives that are better in specific situations.) The field of data sketches and streaming algorithms has many others. k-Minimum values is a nice example of a set cardinality estimator that is very easy to understand.

评论 #43867402 未加载

评论 #43867813 未加载

redbell19 days ago

Related: <a href="https://news.ycombinator.com/item?id=42293832">https://news.ycombinator.com/item?id=42293832</a>Also, this -somehow- related Ask HN is a true classic, at least for me: <a href="https://news.ycombinator.com/item?id=32186203">https://news.ycombinator.com/item?id=32186203</a>

thrance20 days ago

I've known about Bloom Filters for a while now, and I like the idea of them, but I've never had a need for them yet.

评论 #43885223 未加载

评论 #43868276 未加载

esafak20 days ago

Available in Redis: <a href="https://redis.io/docs/stack/bloom/" rel="nofollow">https://redis.io/docs/stack/bloom/</a>

klaussilveira19 days ago

You can even use Bloom Filters for rate limiting with counting BF:<a href="https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4fae130c58592bbb1ca295f7296cf90/src/bloom/counting_bloom.h#L610">https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4...</a>

评论 #43872584 未加载

fooker20 days ago

Here's my use case for it: I want really small filters (say 64 bits or 128 bits), and use these to represent 50-100 element subsets of a ~1000-2000 element set.I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?

评论 #43872614 未加载

bobosha20 days ago

what are some real-world usecases that people use it for?

评论 #43868784 未加载

评论 #43870036 未加载

评论 #43868224 未加载

评论 #43868260 未加载

评论 #43885754 未加载

评论 #43868613 未加载

评论 #43868565 未加载

评论 #43871361 未加载

评论 #43868251 未加载

andrewstuart20 days ago

Bloom filters - when I eventually learned about them - were nothing at all like what I had assumed from the name. And very interesting too.

评论 #43868576 未加载

gitroom20 days ago

pretty cool seeing how stuff like this keeps getting new uses - i always wanna tinker with these things but never have the right problem yet tbh

londons_explore20 days ago

A whole load of data structures, like bloom filters, are 'efficiency tricks'.For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"

评论 #43868053 未加载

评论 #43870452 未加载

评论 #43869433 未加载

评论 #43881408 未加载

评论 #43868325 未加载