Probabilistic Many-to-Many Relationships (with Bloom Filters)

180 pointsby zacharyvoaseover 12 years ago

12 comments

fusiongyroover 12 years ago

This is a really cool technique and warrants some investigation, but I can't let this go unaddressed:> and the upper bound on the time taken to join all three tables will be the square of thatThese kinds of from-principle assertions about what Postgres's (or other DBs') performance will be like sound helpful but usually aren't. The kinds of queries you issue can change everything. Indexing can change everything. Postgres's configuration can change everything. Actual size of the table can change everything. For example, if the table is small, Postgres will keep it in memory and your plans will have scary looking but actually innocent sequential scans, which I think actually happened in his join table example.Anyway, it's good to have a lot of tools in your toolbox, and this is an interesting tool with interesting uses. I just think it would be a grave error to take the performance ratios here as fixed.

评论 #4460292 未加载

zacharyvoaseover 12 years ago

OP here.The lack of an index on the junction table definitely did have a major effect. By just doing the following:<pre><code> CREATE INDEX ON movie_person (movie_id); CREATE INDEX ON movie_person (person_id); </code></pre> The junction query speeds up to around 2ms—it's comparable to or faster than the bloom filter query. But the trade-off is revealed when you see the total size of the movie_person table (including indexes):<pre><code> SELECT pg_size_pretty(pg_total_relation_size('movie_person')); => 45MB </code></pre> Whereas, by my calculations, the total size added by the bloom filters and hashes on movie and person is just 2094kB in total.I plan on adding an erratum to my article explaining my error, the time/memory trade-off, and ideas for further improvement or exploration, potentially including bloom-based GiST indexes and the opportunities for parallelization.

评论 #4461764 未加载

sligover 12 years ago

Now that's the kind of content that I'd love to see more here. It covers basic tools like sed and awk, to nice concepts I didn't know, like BIT field in Postgres.Any recommended book or set of articles for starting with Postgres?

评论 #4459643 未加载

DEinspanjerover 12 years ago

I like the exploration of this method, but I would have liked to see the actual comparison of any false positives. Bad data can be acceptable in statistical analysis, but if you were showing someone a list of their ratings or the actors who were in the latest Kevin Bacon movie, false positives have a much stronger impact.Is there any chance that the bloom could be used as a short-circuit filter but still follow-up with the m2m join to filter out the false positives? If the query optimizer can take advantage of that, then you could likely balance the size and cost of the bloom field.

评论 #4460013 未加载

mhaleover 12 years ago

Using a custom index implementation that uses bloom filters internally is probably going to work out better in the long run. It should be way more efficient than storing the data in a bit field, using app-layer code to generate the bloom filter values, then doing bitwise comparisons on-the-fly at query time.The Postgres query planner can also recheck constraints automatically to recover from bloom filter false positive matches at query time.FYI -- bloom filters are already used internally within the PostgreSQL intarray contrib module and the full-text search functionality.See: <a href="http://postgresql.1045698.n5.nabble.com/bloom-filter-indexes-td1899247.html" rel="nofollow">http://postgresql.1045698.n5.nabble.com/bloom-filter-indexes...</a> <a href="http://code.google.com/p/postgres-learning/wiki/BloomFilter" rel="nofollow">http://code.google.com/p/postgres-learning/wiki/BloomFilter</a>EDIT: for clarity, typo correction

ocharlesover 12 years ago

These seem to produce entirely different sets of results:For the bloom filter: (actual time=0.033..2.546 rows=430 loops=1)And for the join: (actual time=7.440..64.843 rows=9 loops=1)So the join returned 9 movies for person_id=160, while the bloom filtered returned 430.I understand it's a probabilistic model, but that's a pretty whopping difference in data. Have I missed something?

评论 #4461256 未加载

brown9-2over 12 years ago

Regarding the space consumption, it seems like this is a tradeoff of storing two integers (plus the row overhead) per rating versus storing 335 bytes per movie/user.For the join table, that's 2 integers * 575,281 ratings * 4 bytes = 4,602,248 bytes used in the join table.With the filter, in each movie row, you need to store 1632 bits for the person_filter and 1048 bits for the hash, so 3,883 movies * (1632 bits + 1048 bits) = 1,300,805 bytes.In each user row you need to store the same number of bits for the filter and hash, so 6,040 users * (1632 bits + 1048 bits) = 2,023,400 bytes.Is my math here wrong? With this approach you save about 1.22MB, or about 27% over the join table approach (ignoring how much overhead there is for each row of the table and each page to store the table in).Depending on the dataset it doesn't seem like the space savings would be worth the sacrifice in accuracy.

评论 #4460792 未加载

tmoertelover 12 years ago

EDITED ONE LAST TIME – I PROMISE, I REALLY DO, THIS IS THE LAST TIME – TO ADD:I think the reason that OP's join-based queries are slow is that there are no indexes over his junction table's foreign keys:<pre><code> CREATE TABLE movies_people ( movie_id INTEGER REFERENCES movie, person_id INTEGER REFERENCES person ); </code></pre> Thus when he wants the movies associated with person 160, the database must examine the entire junction:<pre><code> EXPLAIN ANALYZE SELECT * FROM movies_for_people_junction WHERE person_id = 160; Hash Join (cost=282.37..10401.08 rows=97 width=33) (actual time=7.440..64.843 rows=9 loops=1) Hash Cond: (movie_person.movie_id = movie.id) -> Seq Scan on movie_person (cost=0.00..10117.01 rows=97 width=8) (actual time=2.540..59.933 rows=9 loops=1) Filter: (person_id = 160) -> Hash (cost=233.83..233.83 rows=3883 width=29) (actual time=4.884..4.884 rows=3883 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 233kB -> Seq Scan on movie (cost=0.00..233.83 rows=3883 width=29) (actual time=0.010..2.610 rows=3883 loops=1) Total runtime: 64.887 ms </code></pre> Note the sequential scan on movie_person that accounts for 2.5 to 60 seconds. If there were an index on movie_person(person_id), this could be an index scan.(EDITED TO ADD: I totally misread the timings in milliseconds for timings in seconds, so most of what I originally wrote is off by a factor of 1000. I'm leaving it here for entertainment value and because you might want to play with the data set in SQLite. But my point is still valid: a vanilla join is comparable in performance to the OP's bloom-filter method.)I'm having a hard time believing that the straightforward join on a data set as small as the OP's sample is really going to take 65 seconds on PostgreSQL. Maybe that's what EXPLAIN predicts (with spotty stats, I'd wager), but EXPLAIN is not a reliable way to measure performance. For this data, I'd expect real queries to perform much better.EDITED TO ADD: The OP's article shows the results for EXPLAIN ANALYZE, which ought to have performed the queries. So I'm not sure why the results are so slow.Heck, even SQLite, when processing a superset of the data set on my 4-year-old computer, can do the OP's final query (and return additional ratings data) almost instantly:<pre><code> $ time sqlite3 ratings.db ' select * from users natural join ratings natural join movies where user_id = 160 ' > /dev/null real 0m0.006s user 0m0.002s sys 0m0.004s </code></pre> If you want to try some queries yourself, here you go:<pre><code> $ wget http://www.grouplens.org/system/files/ml-1m.zip $ unzip ml-1m.zip $ cd m1-1m $ sqlite3 ratings.db <<EOF CREATE TABLE movies ( movie_id INTEGER PRIMARY KEY NOT NULL , title TEXT NOT NULL , genres TEXT NOT NULL ); CREATE TABLE users ( user_id INTEGER PRIMARY KEY NOT NULL , gender TEXT NOT NULL , age TEXT NOT NULL , occupation TEXT NOT NULL , zipcode TEXT NOT NULL ); CREATE TABLE ratings ( user_id INTEGER REFERENCES users(user_id) , movie_id INTEGER REFERENCES movies(movie_id) , rating INTEGER NOT NULL , timestamp INTEGER NOT NULL , PRIMARY KEY (user_id, movie_id) ); .separator :: .import movies.dat movies .import users.dat users .import ratings.dat ratings EOF </code></pre> Fully enumerating the joined data takes under a second:<pre><code> $ time sqlite3 ratings.db ' select count(*) from users natural join ratings natural join movies ' 1000209 real 0m0.953s user 0m0.925s sys 0m0.021s </code></pre> And even going so far as to print the joined data takes under six seconds:<pre><code> $ time sqlite3 ratings.db ' select * from users natural join ratings natural join movies ' > /dev/null real 0m5.586s user 0m5.497s sys 0m0.059s </code></pre> Sometimes, the old stuff works better than we give it credit for.

评论 #4461159 未加载

评论 #4460652 未加载

评论 #4461423 未加载

评论 #4460664 未加载

b0b0b0bover 12 years ago

Thanks for this interesting writeup; it's definitely thought provoking.Because you can't index that bloom column, it seem's you'd always be doing full table scans.In fact it doesn't appear any indexes were used throughout this whole exercise, is that right?

评论 #4460185 未加载

评论 #4460034 未加载

pr0filer_netover 12 years ago

Nice article!I see the table `movies_people` uses (SIGNED) INTEGERS as datatypes, but they reference to a UNSIGNED BIGINT (SERIAL).

评论 #4462633 未加载

luneyover 12 years ago

Can anyone recommend any good interactive charting examples for many-to-many relationships?

secureover 12 years ago

As the article mentions at the very bottom, this technique is not accurate.This fact makes it unusable for many use-cases, but it’s an interesting and good article nevertheless.

评论 #4459793 未加载

评论 #4459550 未加载