科技回声

5 条评论

senderista超过 1 年前

Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:<a href="https://github.com/senderista/hashtable-benchmarks/blob/master/src/main/java/set/int64/BLPLongHashSet.java">https://github.com/senderista/hashtable-benchmarks/blob/mast...</a><a href="https://github.com/senderista/hashtable-benchmarks/wiki/64-bit-benchmarks">https://github.com/senderista/hashtable-benchmarks/wiki/64-b...</a>

评论 #38714074 未加载

评论 #38712470 未加载

评论 #38715386 未加载

评论 #38712487 未加载

评论 #38714141 未加载

gavinray超过 1 年前

I always enjoy reading stuff written by Andrey, he's a brilliant fellow for sure.Can highly recommend his personal blog as well: <a href="https://puzpuzpuz.dev/" rel="nofollow noreferrer">https://puzpuzpuz.dev/</a>

评论 #38712512 未加载

_a_a_a_超过 1 年前

From article"Imagine that we run this query over a few hundred million rows. This means at least a few hundred million hash table operations. As you might imagine, a slow hash table would make for a slower query. A faster hash table? Faster queries!"I'll read the article properly after this, this is just a quick skim, but I can't see this quote can be correct. Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference.Okay, I'll read the article now…Edit:"If you insert "John" and then "Jane" string keys into a FastMap, then that would later become the iteration order. While it doesn't sound like a big deal for most applications, this guarantee is important in the database world.If the underlying table data or index-based access returns sorted data, then we may want to keep the order to avoid having to sort the result set. This is helpful in case of a query with an ORDER BY clause. Performance-wise, direct iteration over the heap is also beneficial as it means sequential memory access."but "...if the underlying table data or index-based access returns sorted data..." Then you've got sorted data, in which case use a merge join instead of a hash join surely.

评论 #38712623 未加载

rkerno超过 1 年前

Hi, I'm curious how you deal with the potential for hash collisions across a large data set - is that a post-join check?

评论 #38718106 未加载

pixelpoet超过 1 年前

Serious question, if performance is the lynchpin, why write it in Java?Especially considering they use unsafe "heavily", for big joins they could easily just call out to some native code if the surrounding code reaaaaally must be Java (again, why?). It's the worst of both worlds using unsafe Java: you don't get native speed, there's loads of memory overhead from everything being an Object (besides the rest of the VM stuff), and get to "enjoy" GC pauses in the middle of your hot loops, and with fewer safety guarantees than something like Rust.

评论 #38711787 未加载

评论 #38712586 未加载

评论 #38711764 未加载

评论 #38712446 未加载

5 条评论

senderista超过 1 年前

评论 #38714074 未加载

评论 #38712470 未加载

评论 #38715386 未加载

评论 #38712487 未加载

评论 #38714141 未加载

gavinray超过 1 年前

评论 #38712512 未加载

_a_a_a_超过 1 年前

评论 #38712623 未加载

rkerno超过 1 年前

Hi, I'm curious how you deal with the potential for hash collisions across a large data set - is that a post-join check?

Building a faster hash table for high performance SQL joins

5 条评论

Building a faster hash table for high performance SQL joins

5 条评论