> Try zstd --adapt. This feature automatically adapts the compression ratio as the IO conditions change to make the current optimal tradeoff between CPU and “keeping the pipe fed”.<p>I didn't know about that, that's neat.<p>These types of articles come up often, and it's good to proselytize about better algorithms. However the end of the article hints at an issue. Most of the hashing and compression in my life are done embedded in some system or protocol that I can't easily change. Yeah, Docker and Debian and Firefox should use zstd but there's not much I can do about it. I may reach for zstd when I'm moving a big file between systems, but I'd have to install it first and much of the time that's not worthwhile.
> For example, one situation where you want an intentionally slow algorithm is when dealing with passwords.<p>This is true, but there's also more to the story. Modern password hashes like Argon2 force attackers into a "time-memory tradeoff", to try to reduce the advantage that specialized hardware has over the general-purpose computers that human beings use. I find that a lot of folks have memorized a summary like "slow hashes are good", and that's doubly unfortunate, because 1) like the article says, fast hashes are good, and 2) slow hashes in and of themselves are no longer the best we can do for password security. I often wish that password hashes weren't even called "hashes", because it's just a very different problem that they're solving.<p>> or even just a high number of rounds of SHA-512<p>Please god no :)
There are a lot of interesting algorithms, but as the author points out, he had to move the data to a RAM drive to avoid the disk access being the limiting factor. For a lot of use cases, it's not the CPU that is going to limit you.<p>I did love the anecdote about adding gzip moving the bottleneck to the cpu from the network, and actually slowing down the whole system.
These are some of my favorite tricks. SHA2-256 is never “fast” but it can be really slow if you’re using a non-SIMD implementation (openssl uses SIMD).<p>A trick not mentioned here: for Python devs, import “orjson” instead of the json standard library; it is usually a drop-in replacement.
There's always a huge focus on speed in these type of articles, and rightfully so for most applications.<p>However, I often find myself in situations where I care neither about speed nor any other of the traditional performance metrics (memory consumption, latency, bandwidth, parallelizability, etc ...).<p>In these situations, what I actually care about is:<p><pre><code> a) code size (as in: fits compiled in 512 bytes)
b) code simplicity (as in: fits in head, takes up around 20 C++ LOC)
</code></pre>
Very unfortunately, there are very, very few algorithms that are designed to optimize along these lines.<p>Exceptions:<p>TEA : <a href="https://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm" rel="nofollow">https://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm</a><p>Speck: <a href="https://en.wikipedia.org/wiki/Speck_(cipher)" rel="nofollow">https://en.wikipedia.org/wiki/Speck_(cipher)</a><p>Both of these are ciphers, and both can be perverted into becoming hashes for various scenarios.<p>But, there aren't any compression or <i>native</i> crypto-hard hashing algorithms that I know of that are specifically designed to optimize along that particular dimension.<p>Blake3 or zstd are <i>large</i> pieces of code.
This is a great list—to add two categories:<p>If you need bloom filters, then use split block bloom filters [1] which use SIMD to increase speed from 30%-450%.<p>If you need erasure coding, then use Cauchy Reed-Solomon instead of Reed-Solomon. Cauchy Reed-Solomon uses pure XOR so you can do erasure coding at the speed of per-core memory bandwidth.<p>[1] <a href="https://arxiv.org/abs/2101.01719" rel="nofollow">https://arxiv.org/abs/2101.01719</a>
AWS S3 right now supports CRC32, CRC32C, SHA-1, and SHA-256 (<a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html" rel="nofollow">https://docs.aws.amazon.com/AmazonS3/latest/userguide/checki...</a>), which is interesting given the announcement to support the four wasn't long ago (<a href="https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/)—it" rel="nofollow">https://aws.amazon.com/blogs/aws/new-additional-checksum-alg...</a> seems like they went with what's more widely used than actually faster?
We are considering using MD5 for user id + salt hashing for randomizing users for A/B testing. Should blake3 or xxhash work as a replacement for that use case?