Bzip3 – A better and stronger spiritual successor to bzip2

158 pointsby palaiologosabout 3 years ago

22 comments

hannobabout 3 years ago

It seems somewhat suspicious that the benchmarks don't compare to zstd.It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.

评论 #31330358 未加载

评论 #31329745 未加载

评论 #31328649 未加载

评论 #31327586 未加载

klauspostabout 3 years ago

Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

评论 #31326992 未加载

评论 #31326647 未加载

评论 #31326442 未加载

评论 #31328693 未加载

lynguistabout 3 years ago

If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.[1] <a href="https://github.com/lz4/lz4" rel="nofollow">https://github.com/lz4/lz4</a>

评论 #31325931 未加载

pcwaltonabout 3 years ago

The Burrows-Wheeler transform, which was the main innovation of bzip2 over gzip, and which this bzip3 retains, is one of the most fascinating algorithms to study: <a href="https://en.wikipedia.org/wiki/Burrows-Wheeler_transform" rel="nofollow">https://en.wikipedia.org/wiki/Burrows-Wheeler_transform</a>It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.

Klasiasterabout 3 years ago

Here some other BWT compressors in the large text compression benchmark (look for "BWT" in "Alg" column): <a href="http://mattmahoney.net/dc/text.html" rel="nofollow">http://mattmahoney.net/dc/text.html</a>And here a BWT library with benchmarks: <a href="https://github.com/IlyaGrebnov/libsais#benchmarks" rel="nofollow">https://github.com/IlyaGrebnov/libsais#benchmarks</a>

denzquixabout 3 years ago

From their own benchmarks it seems more like bzip3 is geared towards a different compression/speed trade-off than bzip2, rather than an unambiguous all-around improvement. Am I misreading it?

评论 #31325190 未加载

joelthelionabout 3 years ago

In the Era of zstandard, do we really need this?

评论 #31325719 未加载

评论 #31325250 未加载

评论 #31325544 未加载

评论 #31325497 未加载

yakubinabout 3 years ago

From the "disclaimers" section:> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?[1]: C being the compression function, D being the decompression function, and o being function composition.

评论 #31328441 未加载

asicspabout 3 years ago

Good work!I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: <a href="https://github.com/kspalaiologos/bzip3/issues/2" rel="nofollow">https://github.com/kspalaiologos/bzip3/issues/2</a>

评论 #31325072 未加载

williamkuszmaulabout 3 years ago

One of the things that's cool about Bzip is that it makes use algorithmic techniques developed by theoretical computer scientists in order to perform the Burrows Wheeler Transform efficiently. It's a great example of theory and practice working symbiotically.

forgotpwd16about 3 years ago

>better, fasterIf I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.>spiritual successor to BZip2What does that mean? If it isn't related to bzip2, why choose this name?

评论 #31366372 未加载

fefe23about 3 years ago

Hmm, I see LZ77, PPM and entropy coding in the description, and obviously Burrows-Wheeler.Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?Does the idea even make sense? (I'm a layman)

iruoyabout 3 years ago

So bzip2 and bzip3 focus on compressed size, lz4 on compression speed and zstd on decompression speed?

评论 #31327439 未加载

jkbonfieldabout 3 years ago

It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.Also see <a href="https://quixdb.github.io/squash-benchmark/unstable/" rel="nofollow">https://quixdb.github.io/squash-benchmark/unstable/</a> (and without /unstable for more system types) for various charts. No bzip3 there yet though.

评论 #31365927 未加载

kstenerudabout 3 years ago

There comes a point where the complexity itself becomes too much of a liability. It's important to be able to trust these algorithms as well as all popular implementations with your data.

评论 #31325302 未加载

AceJohnny2about 3 years ago

Will bzip3 be added to the Squash benchmarks?<a href="https://quixdb.github.io/squash-benchmark/" rel="nofollow">https://quixdb.github.io/squash-benchmark/</a>I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:<a href="https://en.wikipedia.org/wiki/Calgary_corpus" rel="nofollow">https://en.wikipedia.org/wiki/Calgary_corpus</a>

the-alchemistabout 3 years ago

I'm really interested in GPU-based compression / decompression.Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.

评论 #31366360 未加载

oefrhaabout 3 years ago

Interesting, this seems to be a good replacement for xz if the benchmarks are representative.

joppyabout 3 years ago

Why is there such a big disclaimer/warning on the front? Shouldn’t the program just check that decompress(compress(x)) = x as it goes, and then it can be sure that compress(x) has not lost any data?

评论 #31326147 未加载

72deluxeabout 3 years ago

I use pbzip2 with gusto because the original bzip2 is single-threaded. I heartily recommend it to all I meet, even those in the street!

rurbanabout 3 years ago

Can be easily improved by using the HW crc32, it's just SW crc32.

themusicgod1about 3 years ago

> Github link...so long as this lives in NSA/Microsoft Github, it's not a 'spiritual successor' to anything.