EDITED ONE LAST TIME – I PROMISE, I REALLY DO, THIS IS THE LAST TIME – TO ADD:<p>I think the reason that OP's join-based queries are slow is that there are no indexes over his junction table's foreign keys:<p><pre><code> CREATE TABLE movies_people (
movie_id INTEGER REFERENCES movie,
person_id INTEGER REFERENCES person
);
</code></pre>
Thus when he wants the movies associated with person 160, the database must examine the entire junction:<p><pre><code> EXPLAIN ANALYZE SELECT * FROM movies_for_people_junction WHERE person_id = 160;
Hash Join (cost=282.37..10401.08 rows=97 width=33) (actual time=7.440..64.843 rows=9 loops=1)
Hash Cond: (movie_person.movie_id = movie.id)
-> Seq Scan on movie_person (cost=0.00..10117.01 rows=97 width=8) (actual time=2.540..59.933 rows=9 loops=1)
Filter: (person_id = 160)
-> Hash (cost=233.83..233.83 rows=3883 width=29) (actual time=4.884..4.884 rows=3883 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 233kB
-> Seq Scan on movie (cost=0.00..233.83 rows=3883 width=29) (actual time=0.010..2.610 rows=3883 loops=1)
Total runtime: 64.887 ms
</code></pre>
Note the sequential scan on movie_person that accounts for 2.5 to 60 seconds. If there were an index on movie_person(person_id), this could be an index scan.<p>(EDITED TO ADD: I <i>totally</i> misread the timings in milliseconds for timings in seconds, so most of what I originally wrote is off by a factor of 1000. I'm leaving it here for entertainment value and because you might want to play with the data set in SQLite. But my point is still valid: a vanilla join is comparable in performance to the OP's bloom-filter method.)<p>I'm having a hard time believing that the straightforward join on a data set as small as the OP's sample is really going to take 65 seconds on PostgreSQL. Maybe that's what EXPLAIN predicts (with spotty stats, I'd wager), but EXPLAIN is not a reliable way to measure performance. For this data, I'd expect real queries to perform much better.<p>EDITED TO ADD: The OP's article shows the results for EXPLAIN ANALYZE, which ought to have performed the queries. So I'm not sure why the results are so slow.<p>Heck, even SQLite, when processing a superset of the data set on my 4-year-old computer, can do the OP's final query (and return additional ratings data) almost instantly:<p><pre><code> $ time sqlite3 ratings.db '
select *
from
users
natural join ratings
natural join movies
where user_id = 160
' > /dev/null
real 0m0.006s
user 0m0.002s
sys 0m0.004s
</code></pre>
If you want to try some queries yourself, here you go:<p><pre><code> $ wget http://www.grouplens.org/system/files/ml-1m.zip
$ unzip ml-1m.zip
$ cd m1-1m
$ sqlite3 ratings.db <<EOF
CREATE TABLE movies (
movie_id INTEGER PRIMARY KEY NOT NULL
, title TEXT NOT NULL
, genres TEXT NOT NULL
);
CREATE TABLE users (
user_id INTEGER PRIMARY KEY NOT NULL
, gender TEXT NOT NULL
, age TEXT NOT NULL
, occupation TEXT NOT NULL
, zipcode TEXT NOT NULL
);
CREATE TABLE ratings (
user_id INTEGER REFERENCES users(user_id)
, movie_id INTEGER REFERENCES movies(movie_id)
, rating INTEGER NOT NULL
, timestamp INTEGER NOT NULL
, PRIMARY KEY (user_id, movie_id)
);
.separator ::
.import movies.dat movies
.import users.dat users
.import ratings.dat ratings
EOF
</code></pre>
Fully enumerating the joined data takes under a second:<p><pre><code> $ time sqlite3 ratings.db '
select count(*)
from
users
natural join ratings
natural join movies
'
1000209
real 0m0.953s
user 0m0.925s
sys 0m0.021s
</code></pre>
And even going so far as to <i>print</i> the joined data takes under six seconds:<p><pre><code> $ time sqlite3 ratings.db '
select *
from
users
natural join ratings
natural join movies
' > /dev/null
real 0m5.586s
user 0m5.497s
sys 0m0.059s
</code></pre>
Sometimes, the old stuff works better than we give it credit for.