This is a good looking hash, just from the construction I can see that it should be extremely fast for bulk hashing of large objects and with strong statistical properties.<p>I do want to comment for the wider audience why the details of this construction does not lend itself to small keys, and how you would fix that. I have an (unpublished, I'm lazy) AES-based hash function that is very fast for both bulk hashing and small keys, which required a mostly performance neutral (likely fits in an unused ALU execution port) tweak to the bulk hashing loop.<p>Large key performance is entirely in the bulk loop algorithm (an AES round in this case) and small key performance is entirely in the mixer/finalizer algorithm. If you have a very wide bulk loops with no mixing of the individual lanes then it usually requires a deep mixing stage that makes small keys very expensive since that is a fixed overhead. The question then becomes, how do you cheaply "bank" dispersal of bits across lanes in the bulk loop to shorten the mixing stage <i>without</i> creating an operation dependency between lanes that will massively reduce throughput.<p>As an important observation, vectorizing the AES lanes makes it difficult to disperse bits because there are no good, cheap operations that move bits between lanes. However, the CPU will happily run multiple AES operations and other 128-bit operations concurrently across ALU ports if they are in independent registers. This allows you to trivially create a "mixing" lane that solely exists to aid bit dispersal because you no longer are trying to do sideways operations on a vector. So what does a mixing lane look like that doesn't create dependencies on your hashing lanes? It is so simple it is embarrassing: XOR the input data lanes. The mixing lane is worthless as a hash but it only exists to represent all the bits from all the lanes.<p>At the mixing stage, you simply fold the mixing lane into each of the hashing lanes by running it through a round of AES -- one concurrent operation. This allows you to avoid the deep mixing stage altogether, which means small keys can be hashed very quickly since almost all of the computation is in the finalizer. Using this technique, I've been able to hash small keys in <30 cycles with concurrent AES lanes.
<i>To our surprise, we found a lack of published, well-optimized, large-data hash functions. Most hash work seems to focus on small input sizes (for things like dictionary lookup) or on cryptographic quality. We wanted the fastest possible hash that would be collision-free in practice (like SHA-1 was), and we didn’t need any cryptograhic [sic] security.</i><p>This is an important passage that could be explained in more detail. I can look up a list of non-cryptographic hash functions, nine of which support 128-bit output:<p><a href="https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions" rel="nofollow">https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cry...</a><p>Given that the claim is that Meow Hash is fast, where's the benchmarking to prove it's faster than the alternatives?
Late to the party, some actual benchmarks using an older CPU with AES-NI:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
Kernel 4.18.14
GCC 7.3.0<p>Short version, it's ~2x as fast as anything else on my machine for data > 4k, maybe down to 1k, all the way up to > cache (where various other algorithms also max out memory bandwidth). Other non-cryptographic hashes (for comparison) on my local machine: xxHash64 -> ~13G/s, Metrohash128 -> ~13G/s, Metrohash128_crc -> ~18G/s, t1ha_crc -> ~17.5G/s, etc., but Meow doesn't break the 1G/s barrier until 64-96 bytes (the others mentioned reach that speed at 8 bytes).<p>So yeah, faster for longer keys due to the fast aes mixing + parallel pipeline usage, but slower for shorter keys due to the final mixing dependency at the end.<p>Using a hacked-up version of Smhasher for wider range of hashing sizes (cache on this CPU is 20 megs):<p><pre><code> --- Testing Meow1_64 "parallel aes internals"
[[[ Speed Tests ]]]
Bulk speed test - 67108864-byte keys
Average - 4.556 bytes/cycle - 14337.09 MiB/sec @ 3 ghz
Bulk speed test - 16777216-byte keys
Average - 7.882 bytes/cycle - 24805.52 MiB/sec @ 3 ghz
Bulk speed test - 4194304-byte keys
Average - 9.813 bytes/cycle - 30884.30 MiB/sec @ 3 ghz
Bulk speed test - 1048576-byte keys
Average - 9.657 bytes/cycle - 30390.48 MiB/sec @ 3 ghz
Bulk speed test - 262144-byte keys
Average - 10.642 bytes/cycle - 33490.19 MiB/sec @ 3 ghz
Bulk speed test - 65536-byte keys
Average - 11.871 bytes/cycle - 37360.76 MiB/sec @ 3 ghz
Bulk speed test - 16384-byte keys
Average - 11.336 bytes/cycle - 35675.98 MiB/sec @ 3 ghz
Bulk speed test - 4096-byte keys
Average - 8.241 bytes/cycle - 25936.55 MiB/sec @ 3 ghz
Bulk speed test - 1024-byte keys
Average - 4.056 bytes/cycle - 12765.26 MiB/sec @ 3 ghz</code></pre>
I just ran a simple collision test against it with real life, prod data. It resulted in 13 collisions on 116275 files in 512b mode.<p><a href="https://github.com/cmuratori/meow_hash/issues/7#issuecomment-431657191" rel="nofollow">https://github.com/cmuratori/meow_hash/issues/7#issuecomment...</a>
How does it compare to:<p>FarmHash<p>CityHash<p>SMHasher<p>MetroHash<p>FNV1a_YT<p>Note that if you're just using it for a hash table, where you'll compare the full strings to resolve hash collisions, you don't actually need good randomness. FNV1A_VT is faster than most, has worse collision properties, yet provides a faster overall hash table, because the faster hash (every operation) outweighs the extra full string checks (rare).<p><a href="https://github.com/rurban/smhasher" rel="nofollow">https://github.com/rurban/smhasher</a>
Yep, unrolled AES instructions are the way to go if you need to hash gigs of stuff and you're on a CPU that supports them.<p>-Austin, SMHasher author.
Never heard of these guys, but they got several hundred hours of video footage of building a game from scratch: <a href="https://www.youtube.com/user/handmadeheroarchive/videos" rel="nofollow">https://www.youtube.com/user/handmadeheroarchive/videos</a> looks interesting.
I wonder if speed will still be a reason for doing something like this going forward now that Intel is starting to include SHA acceleration instructions in its processors.
I just happen to need a fast, non-cryptographic checksum for large amount of data. I assumed using CRC32C would be the fastest thing available on x86_64 since SSE 4.2 includes an intrinsic for it.<p>Can I expect Meow Hash being faster than that on x86_64? What about on some other platforms such as common ARM chips? Which one is more collision resistant?<p>Also: the writer seems to be Casey Muratori, who is also the author of the Handmade Hero series! Kudos.
Will this hash be useful to check if the content of files of all sizes has changed or not? Thinking about storing the hash of a file in a database to evaluate when changes has taken place.
<a href="https://github.com/cmuratori/meow_hash/blob/master/meow_hash.h" rel="nofollow">https://github.com/cmuratori/meow_hash/blob/master/meow_hash...</a><p>What do AESRotate(foo,bar) and AESMerge(foo,bar) do?<p>[edit]<p>So it's a simple cyclic rotation preceded by A.Q0 = _mm512_aesdec_epi128(A.Q0, B.Q0)<p>From Googling, _mm512_aesdec_epi128 looks like an Intel processor built-in.
> Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc., it’s going to waste all of its time padding the input<p>This seems like an important caveat that means it will only be used in specialized applications.
I think releasing it on yet-another-open-source-license instead of some of well-known ones severely degrades the possibility for this code to be used by companies.