TechEcho

10 comments

dougallalmost 3 years ago

Nice! (I've been meaning to write up this Apple M1 ~60GB/s version, which I think is similar: <a href="https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84d056d" rel="nofollow">https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84...</a> )

nigeltaoalmost 3 years ago

Here's another SIMD implementation, with commentary: <a href="https://github.com/google/wuffs/blob/main/std/adler32/common_up_x86_sse42.wuffs" rel="nofollow">https://github.com/google/wuffs/blob/main/std/adler32/common...</a>Like the fpng implementation, it's SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16."Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB/s vs 1.76GB/s) than the one from zlib-the-library", which IIUC is roughly comparable to the article's defer32. <a href="https://nigeltao.github.io/blog/2021/fastest-safest-png-decoder.html#adler-32" rel="nofollow">https://nigeltao.github.io/blog/2021/fastest-safest-png-deco...</a>

pizzaalmost 3 years ago

Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it's possible there'll be a huge shakeup![0] <a href="https://github.com/libjxl/libjxl/tree/main/experimental/fast_lossless" rel="nofollow">https://github.com/libjxl/libjxl/tree/main/experimental/fast...</a>[1] <a href="https://github.com/veluca93/fpnge" rel="nofollow">https://github.com/veluca93/fpnge</a>[2] <a href="https://twitter.com/richgel999/status/1485976101692358656" rel="nofollow">https://twitter.com/richgel999/status/1485976101692358656</a>

评论 #32380081 未加载

评论 #32379167 未加载

ebiggersalmost 3 years ago

Note that libdeflate has used essentially the same method since 2016 (<a href="https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32_impl.h#L97" rel="nofollow">https://github.com/ebiggers/libdeflate/blob/v0.4/lib/adler32...</a>), though I recently switched it to use a slightly different method (<a href="https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/adler32_impl.h#L175" rel="nofollow">https://github.com/ebiggers/libdeflate/blob/v1.12/lib/x86/ad...</a>) that performs more consistently across different families of x86 CPUs.

josephgalmost 3 years ago

Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?

评论 #32384401 未加载

TAForObvReasonsalmost 3 years ago

While micro-optimizations are interesting, there are two questions left unanswered:- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn't show up in PNG benchmarks.- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?

评论 #32380015 未加载

NavinFalmost 3 years ago

>diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB/s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).I'd expect 3200mbit/s*(64 data lines)*(2 memory channels) = ~48 GiB/s on a typical DDR4 desktop and a lot more with overclocked ram.Great writeup either way.

评论 #32381973 未加载

jiggawattsalmost 3 years ago

I hope this brilliant work has been merged into the relevant open source libraries.Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…

评论 #32381645 未加载

评论 #32382835 未加载

daniel-cussenalmost 3 years ago

I love this kind of writeup. This is my idea of fun: speedups.

profquailalmost 3 years ago

zlib-ng also has adler32 implementations optimized for various architectures: <a href="https://github.com/zlib-ng/zlib-ng" rel="nofollow">https://github.com/zlib-ng/zlib-ng</a>Might be interesting to benchmark their implementation too to see how it compares.

Computing Adler32 Checksums at 41 GB/s

10 comments

Computing Adler32 Checksums at 41 GB/s

10 comments