TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Massive Speed Gains via Parallelized BZIP2 Compression

43 pointsby SnowLprdalmost 13 years ago

11 comments

aphyralmost 13 years ago
<i>18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!</i><p>Er, not really. How about...<p>"pbzip2 reduced running time by 80%."<p>"pbzip2 took only 20% as long as bzip2 did."<p>"pbzip2 is five times faster."
评论 #4044573 未加载
wmfalmost 13 years ago
BTW, bz2 is kinda over. Check out xz and the parallel version pxz.
评论 #4044274 未加载
评论 #4044445 未加载
评论 #4045580 未加载
评论 #4044755 未加载
th0ma5almost 13 years ago
Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel. I imagine that the compression itself might be slightly less optimal however since similar blocks that could be compressed are on different threads? I didn't dig into how this might or might not be a concern with this project, however. Long of the short of it, however, parallel is the reality. In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?
评论 #4044280 未加载
评论 #4044519 未加载
评论 #4045511 未加载
评论 #4044288 未加载
sciurusalmost 13 years ago
For parallel gzip there's pigz (pronounced pig-zee).<p><a href="http://www.zlib.net/pigz/" rel="nofollow">http://www.zlib.net/pigz/</a>
dguidoalmost 13 years ago
Parallel gzip, in case anyone wanted it: <a href="http://zlib.net/pigz/" rel="nofollow">http://zlib.net/pigz/</a><p>I've used it to great effect during incident response when I needed to search through hundreds of gigs of logs at a time.
malkiaalmost 13 years ago
"The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!"<p>File cache effect? He should cold reboot first (not sure how you force the file cache out on OSX/linux, on Windows I do it with SysInternals RamMap) and try in different order.<p>It could still be faster, but he could really be measuring I/O that was done in the first case, and not in the second.<p>It's also strange that .tar files are used, not tar.bz2 or .tbz (if such extension makes sense)
评论 #4044328 未加载
评论 #4044604 未加载
mattst88almost 13 years ago
I used to use pbzip2 before I learned about lbzip2 (<a href="http://lacos.hu/" rel="nofollow">http://lacos.hu/</a>)<p>lbzip2 is able to decompress single streams using multiple threads, which apparently pbzip2 cannot do. See the thread beginning with <a href="http://lists.debian.org/debian-mentors/2009/02/msg00098.html" rel="nofollow">http://lists.debian.org/debian-mentors/2009/02/msg00098.html</a>
juiceandjuicealmost 13 years ago
bzip2 has always been parallelizable. At one point a few years ago I was working on a compressed file format with that included compressed block metadata, because bzip2 is most efficient when it gets about ~900kB to compress at a time. In effect, you split the file up into 900kb chunks, compress them in parallel, and recombine them into one file at the end.
Inufualmost 13 years ago
Is there a reason this is not the default?
评论 #4044485 未加载
BrainInAJaralmost 13 years ago
is there a pbzip2 that doesn't eat <i>all</i> your memory ?
rorrralmost 13 years ago
A GPU implementation would be cool.