I have posted a much earlier version of this over a year ago [0].<p>Since then a lot has changed. Obviously, the name has changed. This happened for the paper publication [1].<p>I have also optimized the speed, integrated ISA-L for special cases, limited the compression-ratio-dependent maximum memory consumption, and finally added parallelized CRC32 computation, which adds ~5% overhead no matter the number of cores used. At this point, I am leaning towards calling it production-ready although there are still many ideas for improvements.<p>Redoing the benchmarks of the older Show HN, would look like this:<p><pre><code> time pigz -d -c 4GiB-base64.gz | wc -c # real ~13.4 s -> ~320 MB/s
time rapidgzip -d -c 4GiB-base64.gz | wc -c # real ~1.26 s -> ~3.4 GB/s
</code></pre>
However, at this point, the piping itself becomes a problem. Rapidgzip is actually slightly faster than cat when comparing the piped bandwidth! E.g., compare these additional benchmarks:<p><pre><code> time cat 4GiB-base64.gz | wc -c # real ~1.06 s -> ~3.1 GB/s
time fcat 4GiB-base64.gz | wc -c # real ~0.41 s -> ~8.0 GB/s
time rapidgzip -o /dev/null -d 4GiB-base64.gz # real ~0.68 s -> ~6.5 GB/s
</code></pre>
fcat is an alternative cat implementation that uses vmsplice to speed up piping. According to the ReadMe it currently is broken, but it works fine on my system and piping it to md5sum yields consistent results [2].<p>So, at this point, I/O and actually also allocations have become a limiting factor and if you want full speed, you would have to interface with the rapidgzip library interface directly (in C++ or via the Python bindings) and process the decompressed data in memory.<p>The project ReadMe contains further benchmarks with Silesia and FASTQ data and scaling up to 128 cores, for which rapidgzip achieves 12 GB/s for Silesia and 24 GB/s when an index has been created with --export-index and is used with --import-index.<p>It can also be tested with ratarmount 0.14.0, which now uses rapidgzip as a backend by default for .gz and .tar.gz files [3].<p>[0] <a href="https://news.ycombinator.com/item?id=32366959">https://news.ycombinator.com/item?id=32366959</a>
[1] <a href="https://dl.acm.org/doi/10.1145/3588195.3592992" rel="nofollow noreferrer">https://dl.acm.org/doi/10.1145/3588195.3592992</a>
[2] <a href="https://github.com/mre/fcat">https://github.com/mre/fcat</a>
[3] <a href="https://github.com/mxmlnkn/ratarmount">https://github.com/mxmlnkn/ratarmount</a>
Does this work with files compressed using the usual gzip command line program?<p>I thought you had to write "sync points" inside a gzip file to support reading from locations other than the start. If you somehow bypass that requirement, can you explain how?
Long time ago (probably around 2010 or something) I had a similar idea [1] and concluded that it is infeasible due to the LZ77 window but... wow, I never thought that starting a speculative parsing at <i>every</i> possible position can be good enough! Great job!<p>Just in case, my own idea was that, due to the fact that there is exactly one end-of-block symbol per block and the longest codes in dynamic Huffman trees should be lexicographically last due to the canonical encoding. So if you have a long row of 1 bits there is a good chance that this codes the EOB symbol (because it should be the rarest symbol and any reasonable compressor will assign the longest code) and a new block follows. I don't know if this is better than the current brute-force approach.<p>[1] <a href="https://news.ycombinator.com/item?id=29586865">https://news.ycombinator.com/item?id=29586865</a>