TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Thread-Parallel Decompression and Random Access to Gzip Files (Pragzip)

15 pointsby mxmlnknalmost 3 years ago
Hello HN,<p>I&#x27;m very excited to have finished a gzip decoder that can speed up decompression using threads. On my Ryzen 3900X, I measured a 8x speedup over standard gzip, reaching 1.6 GB&#x2F;s for a synthetic file with a consistent compression ratio of 1.3.<p>A functional decompressor like this is kind of a first, that&#x27;s why I am excited. Pragzip implements the two-staged decompression idea put forward with pugz, which unfortunately only works with gzipped text files not arbitrary files and has many more limitations. I think my main contribution over pugz might be a fast (~10 MB&#x2F;s) data-agnostic deflate block finder, which might btw also be used to rescue corrupted gzip files). Note that pigz does compress files in parallel but it effectively is not able to decompress not even their own produced files in parallel.<p>You can try out pragzip via PyPI or by building the C++ pragzip tool from source:<p><pre><code> python3 -m pip install --user pragzip pragzip --version # 0.2.0 </code></pre> Here is a quick comparison with a very Huffman-intensive workload tested on a 12-core Ryzen 3900X:<p><pre><code> base64 &#x2F;dev&#x2F;urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) &gt; 4GiB gzip 4GiB # compresses to a 3.1 GiB large file called 4GiB.gz time gzip -d -c 4GiB.gz | wc -c # real ~21.6 s (~200 MB&#x2F;s) time pigz -d -c 4GiB.gz | wc -c # real ~12.9 s (~332 MB&#x2F;s) time pragzip -P 0 -d -c 4GiB.gz | wc -c # real ~2.7 s (~1.6 GB&#x2F;s decompression bandwidth) </code></pre> I have unit tests for files produced with gzip, bgzip, igzip, pigz, Python&#x27;s gzip, and python-pgzip. It should therefore work for any &quot;normal&quot; gzip file and is feature-complete but needs a lot of testing and polishing. Note that it is very memory-intensive depending on the archive&#x27;s compression factor and of course the number of cores being used. This will be subject to further improvements.<p>Bug reports, feature requests, or anything else are very welcome!

2 comments

intelVISAalmost 3 years ago
This is solid, I liked the pseudo-jthread; kinda wild it took til C++20 to be added.
评论 #32374300 未加载
killingtime74almost 3 years ago
Very nice!