It's good people get interested in the subject. But this is very odd and has some errors. For example xz requires a lot more memory resources than bzip2 (see benchmarks below, Mem column).<p><a href="http://mattmahoney.net/dc/text.html" rel="nofollow">http://mattmahoney.net/dc/text.html</a><p><a href="http://mattmahoney.net/dc/uiq/" rel="nofollow">http://mattmahoney.net/dc/uiq/</a><p>Matt Mahoney mantains the best benchmarks on text and generic compression. Some of the best on the field (like Matt) usually hang out at encode.ru.
Is this decompressing a single stream on multiple processors? My knowledge of gzip is very limited, but I would have thought sequential processing was required. What's the trick here? (TFA doesn't explain anything, and e.g. pigz homepage doesn't either).
Had to try this on my quad core laptop, as I never heard of these tools .<p><pre><code> josh@snoopy:~/Downloads $ grep -m2 -i intel /proc/cpuinfo
vendor_id : GenuineIntel
model name : Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz
josh@snoopy:~/Downloads $ ls -l test
-rw-r--r-- 1 josh josh 1073741824 2012-03-07 20:06 test
josh@snoopy:~/Downloads $ time gzip test
real 0m16.430s
user 0m10.210s
sys 0m0.490s
josh@snoopy:~/Downloads $ time pigz test
real 0m5.028s
user 0m16.040s
sys 0m0.620s
</code></pre>
Looks good.. although the man page describes it as being "an almost compatible replacement for the gzip program".
Is xz less resource intensive then bzip2? My testing (admittedly two years ago or so) showed significant differences, better compression ratio with xz but significantly longer and/or more memory used.
If you're handling a lot of data it make sense to hash-partition it on some key and spread it out to a large number of files.<p>In that case you might have, say, 512 partitions and you can farm out compression, decompression and other tasks to as many CPUs as you want, even other machines in a cluster.
I like to use PPMd (via 7zip) for large volumes of text, but it seems to only be single-threaded, which is a shame. It cuts a good 30% again off the size of the .xml.bz2's that Wikipedia provides.