We're talking about long-term archiving here. That means centuries.<p>My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.<p>But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.<p>A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."<p>Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.<p>My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.
Correct, xz is no longer particularly useful, mostly annoying.<p>For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.<p><a href="https://en.m.wikipedia.org/wiki/SquashFS" rel="nofollow">https://en.m.wikipedia.org/wiki/SquashFS</a><p>For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.<p><a href="https://github.com/facebook/zstd" rel="nofollow">https://github.com/facebook/zstd</a><p>I've used this combination professionally to great effect. You're welcome :)
It was really disappointing when dpkg actively deprecated support (which I feel they should <i>never</i> do) for lzma format archives and went all-in on xz. The decompressor now needs the annoying flexibility mentioned in this article and the only benefit of the format--the ability to do random access on the file--is almost entirely defeated by dpkg using it to compress a tar file (which barely supports any form of accelerated access even when uncompressed; like the best you can do is kind of attempt to skip through file headers, which only helps if the files in the archive are large enough) and, to add insult to injury, the files are now all slightly larger to account for the extra headers :/.<p>Regardless, this is a pretty old article and if you search for it you will find a number of discussions that have already happened about it that all have a bunch of comments.<p><a href="https://news.ycombinator.com/item?id=20103255" rel="nofollow">https://news.ycombinator.com/item?id=20103255</a><p><a href="https://news.ycombinator.com/item?id=16884832" rel="nofollow">https://news.ycombinator.com/item?id=16884832</a><p><a href="https://news.ycombinator.com/item?id=12768425" rel="nofollow">https://news.ycombinator.com/item?id=12768425</a>
I disagree with the premise of the article. Archive formats are all inadequate for long-term resilience and making them adequate would be a violation of the “do one thing and do it right” principle.<p>To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for <i>detecting</i> errors, then generate parity files using PAR[1] or zfec[2] to be used for <i>correcting</i> errors.<p>[1] <a href="https://wiki.archlinux.org/title/Parchive" rel="nofollow">https://wiki.archlinux.org/title/Parchive</a><p>[2] <a href="https://github.com/tahoe-lafs/zfec" rel="nofollow">https://github.com/tahoe-lafs/zfec</a>
Folks seem to be comparing xz to zstd, but if I am understanding correctly the true competitor to xz is the article author’s “lzip” format, which uses the same LZMA compression as xz but with a much better designed container format (at least according to the author).
The vast majority of the discussion is around xz’s inability of dealing with corrupted data. That said, couldn’t you argue that that needs to be solved at a lower level (storage, transport)? I’m not convinced the compression algorithm is the right place to tackle this.<p>Just use a file system that does proper integrity checking/resilvering. Also use TLS to transfer data over the network.
The article spends a lot of time discussing XZ's behavior when reading corrupt archives, but in practice this is not a case that will ever happen.<p>Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.<p>If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.<p>If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.
Archival formats have always been of interest to me, given the very practical need to store a large amount of backups across any number of storage mediums - documents, pictures, music, sometimes particularly good movies, even the occasional software or game installer.<p>Right now, I've personally settled on using the 7z format: <a href="https://en.wikipedia.org/wiki/7z" rel="nofollow">https://en.wikipedia.org/wiki/7z</a><p>The decompression speeds feel good, the compression ratios also seem better than ZIP and somehow it still feels like a widely supported format, with the 7-Zip program in particular being nice to use: <a href="https://en.wikipedia.org/wiki/7-Zip" rel="nofollow">https://en.wikipedia.org/wiki/7-Zip</a><p>Of course, various archivers on *nix systems also seem to support it, so so far everything feels good. Though of course having the chance of an archive getting corrupt and no longer being properly able to decompress it and read all of those files, versus just using the filesystem and having something like that perhaps occur to a single file still sometimes bothers me.<p>Then again, on a certain level, I guess nothing is permanent and at least it's possible to occasionally test the archives for any errors and look into restoring them from backups, should something like that ever occur. Might just have to automate those tests, though.<p>Yet, for the most part, going with an exceedingly boring option like that seems like a good idea, though the space could definitely use more projects and new algorithms for even better compression ratios, so at the very least it's nice to see attempts to do so!
A format for archives, that is contractually built to last, has an impressive test suite, is easily browseable, where blobs can be retrieved individually if needed and is already known everywhere ?<p>Sounds like SQlite, yet again: <a href="https://www2.sqlite.org/sqlar.html" rel="nofollow">https://www2.sqlite.org/sqlar.html</a>
I routinely distribute a ~5meg .xz (~20meg uncompressed) to 20k+ servers across multiple physical data centers on a regular basis. Haven't seen a single failure. It ends up being 1.85megs smaller than the tgz version. Unless someone comes up with a better solution (ie: smaller), I probably won't change that any time soon.
I'd recommend checking out zpaq[1], it purposed for backups, and has great compression (even on low setting) for large 100GB+ file collections. However for smaller stuff I use zstd at level 22 in a tar for most things since it's much faster, though a little heavier.<p>[1] <a href="http://mattmahoney.net/dc/zpaq.html" rel="nofollow">http://mattmahoney.net/dc/zpaq.html</a>
One thing that seems to be unmentioned so far in the conversation: xz is public domain, while lzip is subject to the full-blown GPL (v2 or later).<p>In any case, I don't really bother with compression for my own archival needs. Storage is cheap, and encrypted data is kinda hard to reasonably compress anyway.
This is fairly old. When it came up last time, there were robust arguments that xz was characterized unfairly, and that the author’s format wasn’t very good at recovering in most cases either.
How about just not compressing things for archival? A few bit errors in uncompressed files would end up as just a few bad characters. Whereas a few errors in an uncorrectable compression format might render the entire content useless. Sure they files are huge, but were talking about long term archival. In fact, if the documents are that important, have RAID-style redundancy and multiple-bit ECC in multiple geographic locations as well.
Could anyone recommend a broad scoped evaluation of current compression / archiving formats / algorithms which explores their various merits and failings?
xz got a bit of hype there about 10 years ago, I used it until a couple years ago when I noticed how slow it was with huge DB dumps and how much faster zstd was while still having decent compression.<p>So I have no idea about all this low level stuff, I just know that zstd is overall better for sysadmins.<p>But next time I'm doing any sort of scripting that involves compression I'll take a look at squashfs now due to this thread.
Here's a thought: vinyl.<p>While I haven't done intensive research on this, it occurs to me that plastic lasts a long time. Vinyl records are a format that seems fit for long-term archiving. The format is so obvious that it could be reverse engineered by any future civilization.<p>So at least they'll know something about our taste in music.
about 4 years ago I had to choose a compression format for streaming database backups, and so I compared every option supported by 7z and it was xz that was the best compromise between performance and compression ratio
This serves as another example to me that governance and conflict resolution in the Debian project is really poor.<p>Maintainers are free to do whatever they want, even if it doesn't make any sense at all.