In other compression news, Apple open sourced their implementation of lzfse yesterday: <a href="https://github.com/lzfse/lzfse" rel="nofollow">https://github.com/lzfse/lzfse</a>. It's based on a relatively new type of coding - asymmetric numeral systems. Huffman coding is only optimal if you consider one bit as the smallest unit of information. ANS (and more broadly, arithmetic coding) allows for fractional bits and gets closer to the Shannon limit. It's also simpler to implement than (real world) Huffman.<p>Unfortunately, most open source implementations of ANS are not highly optimized and quite division heavy, so they lag on speed benchmarks. Apple's implementation looks pretty good (they're using it in OS X, err, macOS, and iOS) and there's some promising academic work being done on better implementations (optimizing Huffman for x86, ARM, and FPGA is a pretty well studied problem). The compression story is still being written.
Not only is this a great read, but the follow up for citations is replied with "I am the reference".<p>If this were reddit I'd post the hot fire gif. Eh, here it's anyway: <a href="http://i.imgur.com/VQLGJOL.gif" rel="nofollow">http://i.imgur.com/VQLGJOL.gif</a>
It's annoyingly common how the OP doesn't mark this answer as accepted, or even acknowledge how amazing this answer is from one of the technology's creators -- instead just goes on to ask a followup.
It seems like it wouldn't be that hard to create an indexed tar.gz format that's backwards compatible.<p>One way would be to use the last file in the tar as the index, and as files are added, you can remove the index, append the new file, append some basic file metadata and the compressed offset (maybe of the deflate chunk) into the index, update the index size in bytes in a small footer at the end of the index, and append to the compressed tar (add).<p>You can retrieve the index by starting at the end of the compressed archive, and reading backwards until you find a deflate header (at most 65k plus a few more bytes, since that's the size of a deflate chunk), If it's an indexed tar, the last file will be the index, and the end of the index will be a footer with the index size (so you know the maximum you'll need to seek back from the end). This isn't extremely efficient, but it is limited in scope, and helped by knowing the index size.<p>You could verify the index by checking some or all of the reported file byte offsets. Worst case scenario is small files with one or more per deflate chunk, and you would have to visit each chunk. This makes the worst case scenario equivalent to listing files an un-indexed tar.gz, plus the overhead of locating and reading the index (relatively small).<p>Uncompressing the archive as a regular tar.gz would result in a normal operation, with an additional file (the index) included.<p>I imagine this isn't popular is not because it hasn't been done, but because most people don't really need an index.
Worth reading answer about coolest kid on the block: xz compression algorithm (of lzma fame) plus tar.gz vs tar.xz scenarios/discussion.<p><a href="http://stackoverflow.com/questions/6493270/why-is-tar-gz-still-much-more-common-than-tar-xz" rel="nofollow">http://stackoverflow.com/questions/6493270/why-is-tar-gz-sti...</a>
The last few days i find myself wondering if there needs to be some kind of org set up to preserve this sort of info.<p>Right now it seems to be strewn across a myriad of blogs, forums and whatsnot that risk going poof. And even if the Internet Archive picks them up, it is anything but curated (unlike say wikipedia, even with all the warts).
My father teaching me to type PKUNZIP on files that "ended with .zip" in the DOS shell (not long before the Norton Commander kind of GUI arrived to our computer) is one of my earliest memories as a toddler: I would ask him "What does it mean?" and he would simply not know. It was 1990 and I was 3 and a half I think. When I learned what it stood for it was kind of epic, for me.
It is rare to be able to have a question answered so completely and from such a first-hand source. This post is gold and tickles me in all the right places.<p>StackOverflow is sitting on a veritable treasure trove of knowledge.
Reminds me of the very sad zip story:<p><a href="https://www.youtube.com/watch?v=_zvFeHtcxuA" rel="nofollow">https://www.youtube.com/watch?v=_zvFeHtcxuA</a><p>The whole "The BBS Documentary" is great and I recommend starting at the beginning if you're interested in it.<p><a href="https://www.youtube.com/watch?v=dRap7uw9iWI" rel="nofollow">https://www.youtube.com/watch?v=dRap7uw9iWI</a>
One important difference in practice is that zip files needs to be saved to disk to be extracted. gzip files on the other hand can be stream unzipped i.e curl <a href="http://example.com/foo.tar.gz" rel="nofollow">http://example.com/foo.tar.gz</a> | tar zxvf - is possible but not with zip files. I am not sure if this is a limitation of the unzip tool. I would love to know if there is a work around to this.
I love the discussion in the comments:<p>> This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated. - ThorSummoner<p>> I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source. – Mark Adler
When I read "I am the reference" it reminded me of "I am the danger".<p><a href="https://www.youtube.com/watch?v=3v_zlyHgazs" rel="nofollow">https://www.youtube.com/watch?v=3v_zlyHgazs</a>