TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Let’s implement a Bloom Filter

208 pointsby onatmalmost 5 years ago

10 comments

FreakLegionalmost 5 years ago
Since the submitter is the author:<p>&gt; 6 days after I started to write this post, it is proved that the math behind false positive probability was wrong for 50 years<p>This isn&#x27;t true.<p>The &quot;Bloom math is wrong!&quot; clamor started at least 12 years ago with [1], which brazenly cites Bloom, then proceeds to give an analysis of Knuth&#x27;s subtly different filter from TAOCP instead.<p>There are issues with [1], but its exact false positive formula was eventually proven correct. Its argument is incompatible with Bloom&#x27;s original filter, though, and Knuth explicitly states that the false positive formula he gives is approximate.<p>In other words, nothing was proved wrong. The analysis doesn&#x27;t apply to Bloom and doesn&#x27;t contradict Knuth.<p>Now, [1] may be the first paper to give the exact false positive formula for Knuth&#x27;s filter (and Bloom&#x27;s was given in 1979 by [2]), so it&#x27;s not entirely without merit. But in practice none of this matters anyway. At useful sizes the approximations are accurate. Knuth&#x27;s is a lower, Bloom&#x27;s an upper bound, and the two become functionally equivalent very quickly.<p>1. <a href="https:&#x2F;&#x2F;cglab.ca&#x2F;~morin&#x2F;publications&#x2F;ds&#x2F;bloom-submitted.pdf" rel="nofollow">https:&#x2F;&#x2F;cglab.ca&#x2F;~morin&#x2F;publications&#x2F;ds&#x2F;bloom-submitted.pdf</a><p>2. Algorithm-A in <a href="https:&#x2F;&#x2F;sci-hub.tw&#x2F;10.1109&#x2F;proc.1979.11543" rel="nofollow">https:&#x2F;&#x2F;sci-hub.tw&#x2F;10.1109&#x2F;proc.1979.11543</a>
评论 #24117994 未加载
评论 #24119268 未加载
veselinalmost 5 years ago
The article has a good implemenetation, but when reading about Bloom filters, I often see claims about optimality, but it does not say optimial for what. In this example, an optimal Bloom filter for 1M elements does 7 lookups in a bitvector of 1.14MB, which is usually 3 to 5 L2 cache misses (with 256KB or 512KB cache).<p>But looking up the 1M elements directly in a BTree will usually lead to 1-2 cache misses and depending on the data element size, may even fit in the L3 cache. Then, what is this bloom filter saving me from?
评论 #24119817 未加载
评论 #24122443 未加载
评论 #24119045 未加载
hajilealmost 5 years ago
Bing&#x27;s BitFunnel paper is a pretty interesting read. The parts about scaling bloom filters by flipping their axis and using multiply filter layers is particularly of note if you are actually interested in production use.<p><a href="https:&#x2F;&#x2F;danluu.com&#x2F;bitfunnel-sigir.pdf" rel="nofollow">https:&#x2F;&#x2F;danluu.com&#x2F;bitfunnel-sigir.pdf</a>
_jsdwalmost 5 years ago
“This looks really promising! A Bloom Filter that represents a set of million items with a false-positive rate of 0.01 requires only 9585059 bits and 7 hash functions.“<p>My immediate thought here was that this wasn&#x27;t very space efficient as it requires over 9 bits per item?<p>I&#x27;d have to dig more into this though to confirm or deny my intuition!<p>Edit: this page agrees with the blog post and shows some interesting graphs <a href="https:&#x2F;&#x2F;hur.st&#x2F;bloomfilter&#x2F;?n=1000000&amp;p=0.01&amp;m=&amp;k=" rel="nofollow">https:&#x2F;&#x2F;hur.st&#x2F;bloomfilter&#x2F;?n=1000000&amp;p=0.01&amp;m=&amp;k=</a>
评论 #24117947 未加载
anonytraryalmost 5 years ago
Bloom filters confused me at first, but they&#x27;re really cool. I like to think of bloom filters with a (oversimplified) physics analogy. When you collide two particles together, other smaller particles come flying out of the explosion. You measure their tracks with a film. If you keep smashing particles together, eventually the film will be saturated and you will basically lose the ability to ascertain with confidence whether or not two specific particles were smashed together.
wruzaalmost 5 years ago
I&#x27;ve read that some old factories used bloom filters to quickly look up where a worker is. At the morning they pressed three buttons on a wall that represented a workshop, and with a knowledge of these numbers one could check if that person was not in the workshop.<p>The similar technique but in a reverse I&#x27;ve seen in a factory myself, when 5-7 closest people at the gate queue pressed their buttons, and a controller collected their outstanding passes from the other side and then checked against their faces very quickly. That greatly saved time in a queue.
kilotarasalmost 5 years ago
&gt; hashers: [DefaultHasher; 2],<p>Bikeshedding a little, but that probably shouldn&#x27;t be an array. hashers[0] and hashers[1] aren&#x27;t semantically equivalent and the only usage that array gets is just getting element by index. Something like `startHasher` and `stepHasher` would probably better convey intent.
gmanisalmost 5 years ago
Very nice read. I knew bloom filters but the detailed explanation and the rust code (almost Greek to me) were very nicely explained. I also read through the linked page of Kiran.
hn_throwaway_99almost 5 years ago
I think most people who took CS classes in college are familiar with hash table implementations, but perhaps not bloom filters.<p>Bloom filters are just chaining hash tables, without the chain.
ballooneyalmost 5 years ago
I like hand drawn diagrams and use them a lot, especially since work gave me an ipad, but whenever I see them in ‘high-production-value’, well researched and edited blog posts like this, I am reminded that it still seems to be a hard problem to quickly produce good quality vector diagrams (with foss software).<p>Has the state of the art moved on recently, or is it still a case of fighting TikZ?
评论 #24120386 未加载
评论 #24119187 未加载
评论 #24119717 未加载