TechEcho

1 comment

mgsouthalmost 7 years ago

tldr: CrateDB implemented hashed joins, where both tables are too big to fit into memory, by reading left table in chunks and scanning entire right table for each chunk.<p>Part one at <a href="https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-one/" rel="nofollow">https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-t...</a><p>The hashing is used for equi-joins, where two tables are related with (possibly multiple) equality operators; e.g. "select * from t1 join t2 on t1.a = t2.b and t1.x = t2.y".<p>The benchmarks show a very large improvement over the previous algorithm, but it's still O(M/c*N/d). It would be interesting to see CrateDB only keep the hashes in memory, possibly using Bloom filters or such, then re-read the tables, ignoring any rows that don't match a hash. If the _selected_ rows from the second table will fit into memory, then you can get O(2M+N).

Lab Notes: How We Made Joins 23 Thousand Times Faster, Part Two

1 comment

Lab Notes: How We Made Joins 23 Thousand Times Faster, Part Two

1 comment