Fast Perfect Hashing

138 pointsby dmitover 6 years ago

15 comments

munificentover 6 years ago

This article is interesting but... weird. It seems to assume a bunch of things, and then states some conclusions based on that, but then also assumes the value of the conclusions is obvious.> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the key and the function is theoretically reversible, though not always tractably so.This is true because of the pigeonhole principle, but most people won't assume that the key size is equivalent to the domain size. Usually when using perfect hashing, you are hashing (much) fewer elements than the total range a key can represent.A perfect hash function that uniquely assigns hash values to the eight items you need to store, but gives you back integers anywhere in the 32 bit range isn't super helpful. Or, at least, it's not obvious to me why it would be. They do state the algorithm can be scaled down to 8 or 16 bits, so I guess there's that.> Because the hash is no smaller than the key, the primary use case is randomizing small values like integral types.This just doesn't make sense to me. The key can be much larger than the hash, even with perfect hashing, as long as the number of keys in your data set is smaller than the maximum hash value.The author says "primary use case" but what they really mean is that this isn't what most people would consider a perfect hash at all. It's just an integer permuter. In other words, here as an even faster implementation of what the author would describe as a perfect hash over small integer keys:<pre><code> uint32_t hash(uint32_t key) { return key; } </code></pre> This also guarantees there are no collisions so is a "perfect hash". It's just not a very useful one.The author assumes the reader knows the value of randomizing your integer keys. I'm not sure what that value is in cases where you know you don't have collisions anyway.

评论 #19156197 未加载

评论 #19155994 未加载

nightcrackerover 6 years ago

The OP seems to be confusing terminology. What they construct is a permutation. Everything they say is relevant to permutations, not perfect hashing.In particular:> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the keyThis is just false. Perfect hashing is often if not always performed on subsets of the full domain, e.g. the keywords in a programming language.

评论 #19155183 未加载

评论 #19154822 未加载

gumbyover 6 years ago

> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the keyI don't understand this lemma. An indexed array is a degenerate example in which the "hash" (pointer) is presumably a lot smaller than the key (index).BTW for those not familiar with a perfect hash function they are great for things like symbol tables that are immutable at runtime. You can make a very fast lookup that requires no probing (bucket size is always one) so you can also eliminate a lot of implementation both of code and data structure.

评论 #19156039 未加载

评论 #19155433 未加载

评论 #19157521 未加载

评论 #19154793 未加载

inciampatiover 6 years ago

Coincidentally, I was just testing an actual perfect hash function generator <a href="https://github.com/rizkg/BBHash" rel="nofollow">https://github.com/rizkg/BBHash</a>. It requires 2-3 bits per entry irrespective of key or table size. They show that they can build the table for 10^12 entries.One problem is that identifying keys that aren't in the table can be as expensive as storing all the keys. I was thinking that it would be interesting to build two tables with different hash functions for the same set of keys and then record the mapping between them. This would double the table size and add n log n bits to record the relationship, but it would let you make probabilistic (but pretty confident) estimates of what keys weren't in the table, all with a fixed number of bits per key.To clarify, if you map a key into the table with both functions, by construction it will map to a position in each table that are linked together. If a key maps to unlinked positions in each table then we know with probability ~(1-1/n) that the key isn't in the input to the table. When keys are big and n is big you could be very confident in this estimate and save a huge amount of space over typical structures to do this.

chubotover 6 years ago

The algorithms presented below are much faster than most non-perfect hash functions on recent CPUs.This is a very vague statement. I would like to see some numbers, as well as real-world use cases.Last month there was a thread about perfect hashing in the postgres parser, but I wonder if a DFA / trie is better, since you don't have to always look at every byte.<a href="https://news.ycombinator.com/item?id=18880374" rel="nofollow">https://news.ycombinator.com/item?id=18880374</a>

CoolGuySteveover 6 years ago

This is a nice simple mixing function on newer Intel chips.But since the XOR doesn't increase entropy, I think you can save a cacheline load of the 0xDEADBEEF constant by passing _mm_setzero_si128() instead of _mm_set1_epi32(0xDEADBEEF).

Piezoidover 6 years ago

A good blog post about xor-shift-multiply reversible hash functions: <a href="https://nullprogram.com/blog/2018/07/31/" rel="nofollow">https://nullprogram.com/blog/2018/07/31/</a> The inverse functions are easily found by computing the modular inverses of the odd multiplicative factors.Real perfect hashing is more about hashing a non contiguous set of elements with no collisions, unlike the set of all integers that fit in n bits.A different problem is Minimal Perfect Hashing. In addition to being injective, the function should have a minimal image of the input set: the set of keys is mapped to consecutive integers without vacant slots.The implementations in optimal space are non-practical but some libraries construct such functions with a bit more memory (2-3bits/key), like emphf or BBhash (both on github). The latter use a series of 1-hash bloom filters where keys having collision on one layer are removed to be handled by deeper layers. The overall rank() of the bit where a key ultimately lands gives its hash.

berbcover 6 years ago

As pointed out by some of you, this looks more like permutation of sequences. Also, since a symmetric cipher is used, I'm surprised the author didn't mention Black and Rogaway [1].Their algorithm is a permutation (int -> int) that works on a domain of any size up to a limit. A typical application for this is encrypting credit card numbers so that the ciphertext still looks like a credit card number (non-trivial because the size of domain isn't a power of two) or efficiently shuffling sequences, randomly in appearance but repeatably if you know the seed.For instance, this is used by Masscan to randomize the order in which IP addresses and ports are scanned [2]. I've built a Python package that could help you use this algorithm [3] (mostly for fun, but maybe that's useful, let me know :)).[1]: <a href="https://en.wikipedia.org/wiki/Format-preserving_encryption#The_FPE_constructions_of_Black_and_Rogaway" rel="nofollow">https://en.wikipedia.org/wiki/Format-preserving_encryption#T...</a> [2]: <a href="https://github.com/robertdavidgraham/masscan/blob/6c15edc280645b59aadb562333511cebd3405246/src/rand-blackrock.c" rel="nofollow">https://github.com/robertdavidgraham/masscan/blob/6c15edc280...</a> [3]: <a href="https://github.com/bbc2/shuffled" rel="nofollow">https://github.com/bbc2/shuffled</a>

jemfinchover 6 years ago

Seems to accomplish the same purpose as Fibonacci Hashing[0], but with more steps.[0] <a href="https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/" rel="nofollow">https://probablydance.com/2018/06/16/fibonacci-hashing-the-o...</a>

norswapover 6 years ago

And to think I was excited to finally get a good explanation of what people usually mean when they say "perfect hashing" (i.e, given a set of keys, construct a hashing function that will assign each key a unique hash so as to avoid collision, with a hash-range that has typically little overhead over the number of keys).If someone has good resources on that, I'm still interested. I found it hard to fully internalize the stuff I've found when looking a while back. In particular, I'm confused about the (complexity, time) bounds of such a hash construction process.

评论 #19157912 未加载

beyondCriticsover 6 years ago

As already mentionend, what he describes should be called a bijective mixing function. Now when i am given n different values, i can mix them them with his function and will get n other pairwise different values. That helps me nothing in practive with regard to the task of perfect hashing. Technically i would say, he describes in deed a perfect hashing function, but of questionable quality.

senderistaover 6 years ago

This article is about integer permutations, not perfect hashing. That said, integer permutations have many nice applications. One is in hash tables with integer keys: one can store the "hash code" in lieu of the key, since permutations are reversible.

bmhover 6 years ago

This is awesome! In case you're wondering, the latency of _mm_aesenc_si128 on Skylake and above is only 4 clock cycles

convolvatronover 6 years ago

this..seems extremely cool (?) does someone have a reference or pointer for further study about how this is 'perfect' and not just 'really good'

评论 #19154325 未加载

评论 #19157053 未加载

评论 #19154336 未加载

评论 #19154610 未加载

samirmover 6 years ago

cryptographic hash^