<i>18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!</i><p>Er, not really. How about...<p>"pbzip2 reduced running time by 80%."<p>"pbzip2 took only 20% as long as bzip2 did."<p>"pbzip2 is five times faster."
Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel. I imagine that the compression itself might be slightly less optimal however since similar blocks that could be compressed are on different threads? I didn't dig into how this might or might not be a concern with this project, however. Long of the short of it, however, parallel is the reality. In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?
Parallel gzip, in case anyone wanted it: <a href="http://zlib.net/pigz/" rel="nofollow">http://zlib.net/pigz/</a><p>I've used it to great effect during incident response when I needed to search through hundreds of gigs of logs at a time.
"The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!"<p>File cache effect? He should cold reboot first (not sure how you force the file cache out on OSX/linux, on Windows I do it with SysInternals RamMap) and try in different order.<p>It could still be faster, but he could really be measuring I/O that was done in the first case, and not in the second.<p>It's also strange that .tar files are used, not tar.bz2 or .tbz (if such extension makes sense)
I used to use pbzip2 before I learned about lbzip2 (<a href="http://lacos.hu/" rel="nofollow">http://lacos.hu/</a>)<p>lbzip2 is able to decompress single streams using multiple threads, which apparently pbzip2 cannot do. See the thread beginning with <a href="http://lists.debian.org/debian-mentors/2009/02/msg00098.html" rel="nofollow">http://lists.debian.org/debian-mentors/2009/02/msg00098.html</a>
bzip2 has always been parallelizable. At one point a few years ago I was working on a compressed file format with that included compressed block metadata, because bzip2 is most efficient when it gets about ~900kB to compress at a time. In effect, you split the file up into 900kb chunks, compress them in parallel, and recombine them into one file at the end.