Sam's visual tutorials are as good as ever (doggos are very important, by the way).<p>And if you want to try some code, I've prepared three toy Bloom filter implementations to accompany the article:<p>JavaScript: <a href="https://codapi.org/embed/?engine=browser&sandbox=javascript&src=gist:e7bde93f98c5e47ca38d359a5104fd88:bloom.js" rel="nofollow">https://codapi.org/embed/?engine=browser&sandbox=javascript&...</a><p>Python: <a href="https://codapi.org/embed/?sandbox=python&src=gist:e7bde93f98c5e47ca38d359a5104fd88:bloom.py" rel="nofollow">https://codapi.org/embed/?sandbox=python&src=gist:e7bde93f98...</a><p>Go: <a href="https://codapi.org/embed/?sandbox=go&src=gist:e7bde93f98c5e47ca38d359a5104fd88:bloom.go" rel="nofollow">https://codapi.org/embed/?sandbox=go&src=gist:e7bde93f98c5e4...</a>
I remember discussing bloom and cuckoo filters a while back and someone linked.to.great papers from late 2010s with much more advanced probabilistic structures that were handling modification much better or had other interesting properties. Anyone here knows where to look for these? I didn't take notes...
A wonderful article, thank you.<p>Rather than using multiple hash functions, would it make more sense to use a single algorithm over (prefix | input), with k different prefixes? This may allow computing those hashes in parallel, using SIMD for example, and caching the prehash state of the prefixes.<p>Edit: looks like there has been some research on this: <a href="https://ieeexplore.ieee.org/document/8462781" rel="nofollow">https://ieeexplore.ieee.org/document/8462781</a>
this is a great article! I love how you explained it. here are my takeaways from it:
- Bloom filter is a space-efficient data structure
- You're trading off some accuracy (possibility of false positives) for the significantly smaller memory footprint and faster membership checks
Very good explanation of Bloom filters, but are they optimum from a complexity point of view given a false positive rate and a bounded number of bits? Are there other kind of filters if we relax de condición of zero false negative rate?, are there other better but more complex data structures able to provide the same effect?
For whom that are interested I suggest also to have a look on Cukoo Filter: <a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf" rel="nofollow">https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf</a>
Would it be better to store the hash of each output into 3 different arrays?<p>e.g. currently, it's doing:<p><pre><code> [input] -> [3 hashes] -> [single storage]
</code></pre>
but, I'm wondering if it's better to do:<p><pre><code> [input]
-> [hash a] -> [storage a]
-> [hash b] -> [storage b]
-> [hash c] -> [storage c]
</code></pre>
...this way, the output of the 3 hashes don't affect each other during the set and membership-check. I wonder how that would affect how much more data you can store. I'm sure someone has considered this.