TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Computing Adler32 Checksums at 41 GB/s

98 pointsby woooshalmost 3 years ago

10 comments

dougallalmost 3 years ago
Nice! (I&#x27;ve been meaning to write up this Apple M1 ~60GB&#x2F;s version, which I think is similar: <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;dougallj&#x2F;66151f1c509484a42fe0abd0d84d056d" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;dougallj&#x2F;66151f1c509484a42fe0abd0d84...</a> )
nigeltaoalmost 3 years ago
Here&#x27;s another SIMD implementation, with commentary: <a href="https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;wuffs&#x2F;blob&#x2F;main&#x2F;std&#x2F;adler32&#x2F;common_up_x86_sse42.wuffs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;wuffs&#x2F;blob&#x2F;main&#x2F;std&#x2F;adler32&#x2F;common...</a><p>Like the fpng implementation, it&#x27;s SSE (128-bit registers), but the inner loop eats 32 bytes at a time, not 16.<p>&quot;Wuffs’ Adler-32 implementation is around 6.4x faster (11.3GB&#x2F;s vs 1.76GB&#x2F;s) than the one from zlib-the-library&quot;, which IIUC is roughly comparable to the article&#x27;s defer32. <a href="https:&#x2F;&#x2F;nigeltao.github.io&#x2F;blog&#x2F;2021&#x2F;fastest-safest-png-decoder.html#adler-32" rel="nofollow">https:&#x2F;&#x2F;nigeltao.github.io&#x2F;blog&#x2F;2021&#x2F;fastest-safest-png-deco...</a>
pizzaalmost 3 years ago
Ooh now that is very interesting. I would really love to see how this speeds up the run-time of fpng as a whole, if you have any numbers. It looks like fjxl [0] and fpnge [1] (which also uses AVX2) are at the Pareto front for lossless image compression right now [2], but if this speeds things significantly then it&#x27;s possible there&#x27;ll be a huge shakeup!<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;libjxl&#x2F;libjxl&#x2F;tree&#x2F;main&#x2F;experimental&#x2F;fast_lossless" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;libjxl&#x2F;libjxl&#x2F;tree&#x2F;main&#x2F;experimental&#x2F;fast...</a><p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;veluca93&#x2F;fpnge" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;veluca93&#x2F;fpnge</a><p>[2] <a href="https:&#x2F;&#x2F;twitter.com&#x2F;richgel999&#x2F;status&#x2F;1485976101692358656" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;richgel999&#x2F;status&#x2F;1485976101692358656</a>
评论 #32380081 未加载
评论 #32379167 未加载
ebiggersalmost 3 years ago
Note that libdeflate has used essentially the same method since 2016 (<a href="https:&#x2F;&#x2F;github.com&#x2F;ebiggers&#x2F;libdeflate&#x2F;blob&#x2F;v0.4&#x2F;lib&#x2F;adler32_impl.h#L97" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ebiggers&#x2F;libdeflate&#x2F;blob&#x2F;v0.4&#x2F;lib&#x2F;adler32...</a>), though I recently switched it to use a slightly different method (<a href="https:&#x2F;&#x2F;github.com&#x2F;ebiggers&#x2F;libdeflate&#x2F;blob&#x2F;v1.12&#x2F;lib&#x2F;x86&#x2F;adler32_impl.h#L175" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ebiggers&#x2F;libdeflate&#x2F;blob&#x2F;v1.12&#x2F;lib&#x2F;x86&#x2F;ad...</a>) that performs more consistently across different families of x86 CPUs.
josephgalmost 3 years ago
Does anyone have any recommendations for checksumming algorithms in greenfield systems? It seems like there’s lots of innovation in crypto secure hashing functions. But I have a greenfield project where I need checksums but don’t care about crypto properties. Is CRC32c still a good choice or has the industry moved on?
评论 #32384401 未加载
TAForObvReasonsalmost 3 years ago
While micro-optimizations are interesting, there are two questions left unanswered:<p>- Does this change noticeably affect the total runtime? The checksum seems simple enough that the slight difference here wouldn&#x27;t show up in PNG benchmarks.<p>- The proposed solution uses AVX2, which is not currently used in the original codebase. Would any other part of the processing benefit from using newer instructions?
评论 #32380015 未加载
NavinFalmost 3 years ago
&gt;diminishing returns especially due to it working faster than the speed of my RAM (2667MT&#x2F;s * 8 = ~21 GB&#x2F;s).<p>That sounds kinda slow; Is there only 1 DIMM in the slots? I remember benchmarking 40GiB&#x2F;s read speed on an older system that had 2 dual-rank DIMMs (4 ranks in total).<p>I&#x27;d expect 3200mbit&#x2F;s*(64 data lines)*(2 memory channels) = ~48 GiB&#x2F;s on a typical DDR4 desktop and a lot more with overclocked ram.<p>Great writeup either way.
评论 #32381973 未加载
jiggawattsalmost 3 years ago
I hope this brilliant work has been merged into the relevant open source libraries.<p>Something that’s unfair about the world is that work like this could reach billions of people and save a million dollars worth of time and electricity annually but is being done gratis.<p>It would be amazing if there were charities that rewarded high-impact open source contributions like this proportionally to the benefits to humanity…
评论 #32381645 未加载
评论 #32382835 未加载
daniel-cussenalmost 3 years ago
I love this kind of writeup. This is my idea of fun: speedups.
profquailalmost 3 years ago
zlib-ng also has adler32 implementations optimized for various architectures: <a href="https:&#x2F;&#x2F;github.com&#x2F;zlib-ng&#x2F;zlib-ng" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;zlib-ng&#x2F;zlib-ng</a><p>Might be interesting to benchmark their implementation too to see how it compares.