If anyone is interested, I have a cute nearly-minimal perfect hashing algorithm designed to have good cache-friendly properties. It works very well in practice and in my particular application was performing faster and more consistently (no long tail) than anything found in CMPH.<p>Briefly, it is somewhat similar to hopscotch (or robin-hood) hashing, only you pre-calculate positions of the elements to put them into optimal spots by solving the assignment problem (via Hungarian assignment problem solver or such). Works for up to about 50k elements. It feels like it might have good theoretical properties, might be even optimal (after all, we pretty much calculating optimal positions of the elements based on the costs of our memory/cache accesses and probabilities of each element), but it was a while since I've taken the algorithms class.<p>If anyone is interested to do a writeup and publish clean source code - you'd be welcome, ping me via e-mail.
<i>we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string</i><p>This is what a good implementation of memcmp would do anyway (and on x86, one may compare up to 16 bytes at once using SSE), but the problem is that memcmp is generic enough that it must work for all lengths so for short comparisons you lose more cycles in switching between all the special cases than is gained from the comparison itself.
For the love of FSM, <i></i>please<i></i> don't call your library "phash". "phash" is already in wide use for "perceptual hash", or the hashing mechanisms used for things like fuzzy image searching.
One caveat of perfect hash functions, including CMPH, is that for a very large amount of keys the data structs they use internally become very large. In the case of CMPH when I tried with around 30 million keys the result was several MBs which caused it to not fit in L3 cache. For large dictionaries, hash functions like xxhash and farmhash are usually a better option than the perfect hash functions.
cmph is not the fastest library for MPHF, mostly because it uses a slow ranking function to turn a PHF into a MPHF.<p>I wrote a small library some time ago that performs much faster, both for in-memory construction and lookups, [1, see the linked paper for benchmarks], unfortunately I have no time to maintain it but I recently found out that some projects are using it, so it wasn't all wasted time :)<p>[1] <a href="https://github.com/ot/emphf" rel="nofollow">https://github.com/ot/emphf</a><p></shameless plug>
i still fail to see the enormous fuss about hash tables as if they are some revelatory data structure.<p>the special cases where you can construct perfect hashes are normally amenable to setting some fixed index 'hashes' for each item and storing them in an array, then never referring to their values for lookups in code, but always using the indices.<p>in heavily compiled languages like C and C++, when you know everything at compile-time then the run-time should become zero. anything more is a failure to use the language to sufficiently inform the compiler of what is happening.<p>i suspect this is why perfect hashes do not see the enormous widespread use, because there is a solution for the most common use cases with the exceptional run-time complexity of O(0).<p>this is enormously faster than the necessarily more complicated perfect hash functions, and only falls down when you have a run-time requirement to do the lookup from the value instead of the key... which in my experience goes hand in hand with not knowing your data until run-time, which rules out a perfect hashing function.<p>in most run-time cases a binary tree is good enough, its very rare that you need better, and even when you do the optimisation can be found by optimising that structure to pool allocate nodes or similar. the usual, array of linked-list approach can be better for small data sets too... but a perfect hash function isn't even applicable.<p>all that being said, it is the right solution in those exceptionally rare cases where you know your data ahead of time, but need to look it up by value, rather than hash, at run-time.