It looks like the speedup is coming from two main changes.<p>The first change is reducing the number of rounds from 10 to 7. Think of it like making a smoothie - you add bits of fruit to the drink (the input data), then pulse the blades to blend it up (making the output hash). This change basically runs the blades for 7 seconds instead of 10 seconds each time they add fruit. They cite evidence that the extra 3 seconds aren't doing much - once the fruit's fully liquid, extra blending doesn't help - but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.<p>The other change they make is to break the input into 1KiB chunks, then hash each chunk independently. Finally, they combine the individual chunk hashes into a single big hash using a binary tree. The benefit is that if you have 4KiB of data, you can use 4-way SIMD instructions to process all four chunks simultaneously. The more data you have, the more parallelism you can unlock, unlike traditional hash functions that process everything sequentially. On the flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit CPU's that struggle with 64-bit arithmetic. The tree structure is a big win overall.
Another benchmark:<p>time openssl sha256 /tmp/bigfile<p><pre><code> real 0m28.160s
user 0m27.750s
sys 0m0.272s
</code></pre>
time shasum -a 256 /tmp/bigfile<p><pre><code> real 0m6.146s
user 0m5.407s
sys 0m0.560s
</code></pre>
time b2sum /tmp/bigfile<p><pre><code> real 0m1.732s
user 0m1.450s
sys 0m0.244s
</code></pre>
time b3sum /tmp/bigfile<p><pre><code> real 0m0.212s
user 0m0.996s
sys 0m0.379s
</code></pre>
TIL OpenSSL sha256 invocation is really slow compared to the shasum program. Also BLAKE3 is <i>really</i> fast.<p>Edit: bigfile is 1GB of /dev/random
So this is bao + blake2?<p>I remember watching Bao, a general purpose cryptographic tree hash, and perhaps the fastest hash function in the world: <a href="https://www.youtube.com/watch?v=Dya9c2DXMqQ" rel="nofollow">https://www.youtube.com/watch?v=Dya9c2DXMqQ</a> a while ago.<p>Nice job!
Looks like they've taken <i>Too Much Crypto</i> to heart[1] and dropped the number of rounds from Blake2B's 12 down to 7 for Blake3:<p><a href="https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_impl/reference_impl.rs#L83-L95" rel="nofollow">https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_...</a><p><a href="https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref.c#L200-L211" rel="nofollow">https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref...</a><p>Which, yeah, that alone will get you a significant improvement over Blake2B. But definitely doesn't account for the huge improvement they're showing. Most of that is the ability to take advantage of AVX512 parallelism, I think. The difference will be more incremental on AVX2-only amd64 or other platforms, I think.<p>[1]: Well, TMC recommended 8 rounds for Blake2B and 7 for Blake2S.
That's impressive speedup. I just installed it and holy moly it really is fast. All those extra cores can finally get busy. ;)<p>b3sum -l 256 big-2.6Gfile<p><pre><code> real 0m0.384s
user 0m2.302s
sys 0m0.175s
</code></pre>
b2sum -l 256 big-2.6Gfile<p><pre><code> rear 0m3.616s
user 0m3.360s
sys 0m0.256s
</code></pre>
(Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 )<p>EDIT: ah, the catch. blake3 targets 128 bit security. It competes with SipHash for speed and security<p>EDIT2 scratch the previous edit.
I can't seem to find any non-rust implementations in the works yet, so I may sit down and adapt the reference to C# this weekend. Anyone know how the single/few threads performance holds up excluding avx512?
smhasher results without the Rust version yet (which should be ~2x faster):<p><a href="http://rurban.github.io/smhasher/doc/table.html" rel="nofollow">http://rurban.github.io/smhasher/doc/table.html</a><p>It's of course much faster as most of the other crypto hashes, but not faster than the hardware variants of SHA1-NI and SHA256-NI.
About 4x faster than blake2.<p>Faster than SipHash, not faster than SipHash13.<p>The tests fail on MomentChi2 dramatically, which describe how good the user-provided random seed is mixed in. I tried by mixing a seed for IV[0], as with all other hardened crypto hashes, and for all 8 IV's, which didn't help. So I'm not convinced that a seeded IV is properly mixed in. Which is outside the usage pattern of a crypto or digest hash (b3sum), but inside a normal usage.<p>Rust staticlib is still in work, which would parallize the hashing in chunks for big keys. For small keys it should be even a bit slower. b3sum is so much faster, because it uses many more tricks, such as mmap.
Why are there no AES-like hashing algorithms out there? AES design is very suitable to be used as a building block in a hash if you remove "add round key" operation.
Is it possible to benchmark agaist blake2 etc. but where they have the same number of rounds, testing both for reducing blake2 and also increasing blake3? Also, in that vein, offering the version with more rounds could win over the "paranoid" for mostly being a faster Blake2 thanks to SIMD and extra features thanks to the Merkle tree?
<p><pre><code> Benchmark #1: cat b1
Time (mean ± σ): 1.076 s ± 0.007 s [User: 5.3 ms, System: 1069.4 ms]
Range (min … max): 1.069 s … 1.093 s 10 runs
Benchmark #2: sha256sum b1
Time (mean ± σ): 6.583 s ± 0.064 s [User: 5.440 s, System: 1.137 s]
Range (min … max): 6.506 s … 6.695 s 10 runs
Benchmark #3: sha1sum b1
Time (mean ± σ): 6.322 s ± 0.086 s [User: 5.212 s, System: 1.103 s]
Range (min … max): 6.214 s … 6.484 s 10 runs
Benchmark #4: b2sum b1
Time (mean ± σ): 13.184 s ± 0.108 s [User: 12.090 s, System: 1.080 s]
Range (min … max): 13.087 s … 13.382 s 10 runs
Benchmark #5: b3sum b1
Time (mean ± σ): 577.0 ms ± 5.4 ms [User: 12.276 s, System: 0.669 s]
Range (min … max): 572.4 ms … 587.0 ms 10 runs
Benchmark #6: md5sum b1
Time (mean ± σ): 14.851 s ± 0.175 s [User: 13.717 s, System: 1.117 s]
Range (min … max): 14.495 s … 15.128 s 10 runs
Summary
'b3sum b1' ran
1.86 ± 0.02 times faster than 'cat b1'
10.96 ± 0.18 times faster than 'sha1sum b1'
11.41 ± 0.15 times faster than 'sha256sum b1'
22.85 ± 0.28 times faster than 'b2sum b1'
25.74 ± 0.39 times faster than 'md5sum b1'
</code></pre>
gotdang that's some solid performance. (here running against 10GiB of random bytes; machine has the Sha ASM extensions, which is why sha256/sha1 perform so well)<p>edit: actually not a straight algo comparison, as b3sum here is heavily benefiting from multi-threading; without that it looks more like this:<p><pre><code> Benchmark #1: cat b1
Time (mean ± σ): 1.090 s ± 0.007 s [User: 2.9 ms, System: 1084.8 ms]
Range (min … max): 1.071 s … 1.096 s 10 runs
Benchmark #2: sha256sum b1
Time (mean ± σ): 6.480 s ± 0.097 s [User: 5.359 s, System: 1.115 s]
Range (min … max): 6.346 s … 6.587 s 10 runs
Benchmark #3: sha1sum b1
Time (mean ± σ): 6.120 s ± 0.090 s [User: 5.027 s, System: 1.082 s]
Range (min … max): 5.979 s … 6.233 s 10 runs
Benchmark #4: b2sum b1
Time (mean ± σ): 12.866 s ± 0.208 s [User: 11.722 s, System: 1.133 s]
Range (min … max): 12.549 s … 13.124 s 10 runs
Benchmark #5: b3sum b1
Time (mean ± σ): 5.813 s ± 0.079 s [User: 4.606 s, System: 1.202 s]
Range (min … max): 5.699 s … 5.933 s 10 runs
Benchmark #6: md5sum b1
Time (mean ± σ): 14.355 s ± 0.184 s [User: 13.305 s, System: 1.039 s]
Range (min … max): 14.119 s … 14.605 s 10 runs
Summary
'cat b1' ran
5.33 ± 0.08 times faster than 'b3sum b1'
5.62 ± 0.09 times faster than 'sha1sum b1'
5.95 ± 0.10 times faster than 'sha256sum b1'
11.81 ± 0.21 times faster than 'b2sum b1'
13.17 ± 0.19 times faster than 'md5sum b1'
</code></pre>
still beating the dedicated sha extensions, but not nearly as dramatically.
Where is this useful?<p>I'm guessing not for password hashes simply because a fast hash is bad for passwords (makes brute forcing/rainbow tables easier).<p>So is this mostly just for file signing?