pigz: A parallel implementation of gzip for multi-core machines

289 点作者 firloop超过 2 年前

22 条评论

jiggawatts超过 2 年前

Funny this comes up again so soon after I needed it! I recently did a proof-of-concept related to bioinformatics (gene assembly, etc...), and one quirk of that space is that they work with enormous text files. Think tens of gigabytes being a "normal" size. Just compressing and copying these around is a pain.One trick I discovered is that tools like pigz can be used to both accelerate the compression step and also copy to cloud storage in parallel! E.g.:<pre><code> pigz input.fastq -c | azcopy copy --from-to PipeBlob "https://myaccountname.blob.core.windows.net/inputs/input.fastq.gz?..." </code></pre> There is a similar pipeline available for s3cmd as well with the same benefit of overlapping the compression and the copy.However, if your tools support zstd, then it's more efficient to use that instead. Try the "zstd -T0" option or the "pzstd" tool for even higher throughputs but with same minor caveats.PS: In case anyone here is working on the above tools, I have a small request! What would be awesome is to automatically tune the compression ratio to match the available output bandwidth. With the '-c' output option, this is easy: just keep increasing the compression level by one notch whenever the output buffer is full, and reduce it by one level whenever the output buffer is empty. This will automatically tune the system to get the maximum total throughput given the available CPU performance and network bandwidth.

评论 #33240859 未加载

评论 #33244377 未加载

评论 #33241000 未加载

评论 #33244990 未加载

评论 #33243938 未加载

omoikane超过 2 年前

The bit I found most interesting was actually:<a href="https://github.com/madler/pigz/blob/master/try.h" rel="nofollow">https://github.com/madler/pigz/blob/master/try.h</a><a href="https://github.com/madler/pigz/blob/master/try.c" rel="nofollow">https://github.com/madler/pigz/blob/master/try.c</a>which implements try/catch for C99.

评论 #33245715 未加载

评论 #33239991 未加载

sitkack超过 2 年前

If you really want to enable all cores for compression and decompression, give pbzip2 a try. pigz isn't as parallel as pbzip2<a href="http://compression.ca/pbzip2/" rel="nofollow">http://compression.ca/pbzip2/</a>*edit, as ac29 mentions below, just use zstdmt. In my quick testing it is approximately 8x faster than pbzip2 and gives better compression ratios. Wall clock time went from 41s to 3.5s for a 3.6GB tar of source, pdfs and images AND the resulting file was smaller.<pre><code> megs 3781 test.tar 3041 test.tar.zstd (default compression 3, 3.5s) 3170 test.tar.bz2 (default compression, 8 threads, 40s)</code></pre>

评论 #33239465 未加载

评论 #33239624 未加载

ericbarrett超过 2 年前

We used this to great effect at Facebook for MySQL backups in the early 2010s. The backup hosts had far more CPU than needed so it was a very nice speed-up over gzip. Eventually we switched to zstd, of course, but pigz never failed us.

评论 #33239262 未加载

评论 #33239336 未加载

评论 #33240335 未加载

rcarmo超过 2 年前

I chuckled at the name, since out-of-order results are a typical output of parallelization. Kudos.

评论 #33239703 未加载

评论 #33240445 未加载

评论 #33245290 未加载

walrus01超过 2 年前

Would not recommend using this in 2022, use zstandard or xzip instead.zstandard is faster and slightly better compression at speed selection settings that are equivalent to gzip, in addition to having the ability to compress stuff at a much greater ratio, optionally, if you allow it to take more time and cpu resources.<a href="https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/" rel="nofollow">https://gregoryszorc.com/blog/2017/03/07/better-compression-...</a>

评论 #33239694 未加载

josnyder超过 2 年前

This was great in 2012. In 2022, most use-cases should be using parallelized zstd.

评论 #33242253 未加载

bbertelsen超过 2 年前

Warning for the uninitiated. Be cautious using this on a production machine. I recently caused a production system to crash because disk throughput was so high that it started delaying read/writes on a PostgreSQL server. There was panic!

ananonymoususer超过 2 年前

I use this all the time. It's a big time saver on multi-core machines (which is pretty much every desktop made in the past 20 years). It's available in all the repos, but not included by default (at least in Ubuntu/Mint). It is most useful for compressing disk images on-the-fly while backing them up to network storage. It's usually a good idea to zero unused space first:(unprivileged commands follow)dd if=/dev/zero of=~/zeros bs=1M; sync; rm ~/zerosCompressing on the fly can be slower than your network bandwidth depending on your network speed, your processor(s) speed, and the compression level, so you typically tune the compression level (because the other two variables are not so easy to change). Example backup:(privileged commands follow)pv < /dev/sda | pigz -9 | ssh user@remote.system dd of=compressed.sda.gz bs=1M(Note that on slower systems the ssh encryption can also slow things down.)Some sharp people may notice that it's not necessarily a good idea to back up a live system this way because the filesystem is changing while the system runs. It's usually just fine on an unloaded system that uses a journaling filesystem.

评论 #33240324 未加载

lxe超过 2 年前

Protip: if you're on a massively-multicore system and need to tar/gzip a directory full of node_modules, use pigz via `tar -I pigz` or a pipe. The performance increase is incredible.

xfalcox超过 2 年前

One interesting trivia is that since ~2020 Docker will transparently use pigz for decompressing container image layers if it's available on the host. This was a nice speedup for us, since we use large container images and automatic scaling for incoming traffic surges.

评论 #33241399 未加载

评论 #33240208 未加载

评论 #33239712 未加载

fintler超过 2 年前

If you ever run into the limitations of a single machine, dbz2 is also a fun little app for this sort of thing. You can run it across multiple machines and it'll automatically balance the workload across them.<a href="https://github.com/hpc/mpifileutils/blob/master/man/dbz2.1" rel="nofollow">https://github.com/hpc/mpifileutils/blob/master/man/dbz2.1</a>

gww超过 2 年前

There is another nice multi-core gzip based library called BGZF[1]. It is commonly used in bioinformatics. BGZF has the added advantage that it is block compressed with built in indexing method to permit seeking in compressed files.[1] <a href="https://github.com/samtools/htslib" rel="nofollow">https://github.com/samtools/htslib</a>

necovek超过 2 年前

Any comparative benchmarks or a write-up on the approach (other than "uses zlib and pthreads" from the README)?

评论 #33239029 未加载

评论 #33241551 未加载

评论 #33238854 未加载

kristianp超过 2 年前

Pigz has been around for a while. Since 2007 if the copyright on this[1] page is any indication.[1] <a href="https://docs.oracle.com/cd/E88353_01/html/E37839/pigz-1.html" rel="nofollow">https://docs.oracle.com/cd/E88353_01/html/E37839/pigz-1.html</a>

jaimehrubiks超过 2 年前

I used this recently with -0 (no compression) to pack* billions of files into a tar file before sending them over the network. It worked amazing.

评论 #33239270 未加载

评论 #33239267 未加载

_joel超过 2 年前

Use this all the time (or did when I was doing more sysadminy stuff). Useful in all sorts of backup pipelines

taf2超过 2 年前

Pretty sure we used or still use pigz when it's time to create a db replica...

LeoPanthera超过 2 年前

For maximum compression, pLzip offers lzma compression in parallel: <a href="https://www.nongnu.org/lzip/plzip.html" rel="nofollow">https://www.nongnu.org/lzip/plzip.html</a>

ByThyGrace超过 2 年前

On Linux would it Just Work™ if you aliased pigz to gzip as a drop-in replacement?

评论 #33239600 未加载

评论 #33243400 未加载

评论 #33242294 未加载

powerverwirrt超过 2 年前

Funny, I just read about this yesterday. Time to try it on my pile of archived research data.

soulmachine超过 2 年前

I had used pigz for a few years, now I've replace it with `xz -T0`