It seems somewhat suspicious that the benchmarks don't compare to zstd.<p>It's not entirely clear to me what the selling point is. "Better than bzip2" isn't exactly a convincing sales pitch given bzip2 is mostly of historic interest these days.<p>Right now the modern compression field is basically covered by xz (if you mostly care about best compression ratio) and zstd (if you want decent compression and very good speed), so when someone wants to pitch a new compression they should tell me where it stands compared to those.
Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.<p>Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.<p>I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.
If anyone just cares for speed instead of compression I’d recommend lz4 [1]. I only recently started using it. Its speed is almost comparable to memcpy.<p>[1] <a href="https://github.com/lz4/lz4" rel="nofollow">https://github.com/lz4/lz4</a>
The Burrows-Wheeler transform, which was the main innovation of bzip2 over gzip, and which this bzip3 retains, is one of the most fascinating algorithms to study: <a href="https://en.wikipedia.org/wiki/Burrows-Wheeler_transform" rel="nofollow">https://en.wikipedia.org/wiki/Burrows-Wheeler_transform</a><p>It hasn't been used lately because of the computational overhead, but it's interesting and I'm glad that there's still work in this area. For anyone interested in algorithms it's a great one to wrap your head around.
Here some other BWT compressors in the large text compression benchmark (look for "BWT" in "Alg" column):
<a href="http://mattmahoney.net/dc/text.html" rel="nofollow">http://mattmahoney.net/dc/text.html</a><p>And here a BWT library with benchmarks: <a href="https://github.com/IlyaGrebnov/libsais#benchmarks" rel="nofollow">https://github.com/IlyaGrebnov/libsais#benchmarks</a>
From their own benchmarks it seems more like bzip3 is geared towards a different compression/speed trade-off than bzip2, rather than an unambiguous all-around improvement. Am I misreading it?
From the "disclaimers" section:<p><i>> Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.</i><p><i>> However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.</i><p>That got me thinking: I've always implicitly assumed that authors of lossless compression algorithms write mathematical proofs that D o C = id[1]. However, now that I've started looking, I can't seem to find that even for Deflate. What is the norm?<p>[1]: C being the compression function, D being the decompression function, and o being function composition.
Good work!<p>I was also confused with faster speed claims than bzip2, and then saw the discussion in the issue: <a href="https://github.com/kspalaiologos/bzip3/issues/2" rel="nofollow">https://github.com/kspalaiologos/bzip3/issues/2</a>
One of the things that's cool about Bzip is that it makes use algorithmic techniques developed by theoretical computer scientists in order to perform the Burrows Wheeler Transform efficiently. It's a great example of theory and practice working symbiotically.
>better, faster<p>If I'm reading the benchmarks correctly, it gets higher compression but is slower and has higher memory usage. Thus cannot call it better.<p>>spiritual successor to BZip2<p>What does that mean? If it isn't related to bzip2, why choose this name?
Hmm, I see LZ77, PPM and entropy coding in the description, and obviously Burrows-Wheeler.<p>Has anyone tried doing zstd at the end instead of LZ77 and entropy coding?<p>Does the idea even make sense? (I'm a layman)
It doesn't compare itself against bsc, which feels a bit poor IMO given it's using Grebnov's libsais and LZP algorithm (he's the author of libbsc).<p>On my own benchmarks, it's basically comparable size (about 0.1% smaller than bsc), comparable encode speeds, and about half the decode speed. Plus bsc has better multi-threading capability when dealing with large blocks.<p>Also see <a href="https://quixdb.github.io/squash-benchmark/unstable/" rel="nofollow">https://quixdb.github.io/squash-benchmark/unstable/</a> (and without /unstable for more system types) for various charts. No bzip3 there yet though.
There comes a point where the complexity itself becomes too much of a liability. It's important to be able to trust these algorithms as well as all popular implementations with your data.
Will bzip3 be added to the Squash benchmarks?<p><a href="https://quixdb.github.io/squash-benchmark/" rel="nofollow">https://quixdb.github.io/squash-benchmark/</a><p>I note that the "Calgary Corpus" that bzip3 prominently advertises is obsolete, dating back to the late 80s:<p><a href="https://en.wikipedia.org/wiki/Calgary_corpus" rel="nofollow">https://en.wikipedia.org/wiki/Calgary_corpus</a>
I'm really interested in GPU-based compression / decompression.<p>Anyone know what the current SOTA GPU-based algorithms are, and why they haven't taken off?<p>Brotli has gotten browser support, so it seems to my naive self that a GPU-based algorithm is just waiting take over.
Why is there such a big disclaimer/warning on the front? Shouldn’t the program just check that decompress(compress(x)) = x as it goes, and then it can be sure that compress(x) has not lost any data?