Xz format considered inadequate for long-term archiving (2016)

248 pointsby goranmoominalmost 3 years ago

26 comments

OliverJonesalmost 3 years ago

We're talking about long-term archiving here. That means centuries.My brother the archaelogical archivist of ancient (~2000 years BCE) mesopotamian artifacts has a lot to say about archival formats. His raw material is mostly fired clay tablets. Those archives keep working, partially, even if broken. That's good, because many of them are in fact broken when found.But their ordering and other metadata about where they were found is written in archaeologists' notebooks, and many of those notebooks are now over a century old. Paper deteriorates. If a lost flake of paper from a page in the notebook rendered the whole notebook useless, that would be a disastrous outcome for that archive.A decade ago I suggested digitizing the notebooks and storing the bits on CD-ROMs. He laughed, saying "we don't know enough about the long-term viability of CD-ROMs and their readers."Now, when they get around it it, they're digitizing the notebooks and storing them on PDFs on backed-up RAID10 volumes. But they're also printing them on acid-free paper and putting them with the originals in the vaults where they store old books.My point: planning for centuries long archiving is difficult. Formats with redundancy, at least with forward error correction codes, are very helpful. Formats that can be rendered useless by a few bit-flips, not so much.

评论 #32215249 未加载

评论 #32219013 未加载

评论 #32217958 未加载

评论 #32216639 未加载

评论 #32214482 未加载

评论 #32225905 未加载

metadatalmost 3 years ago

Correct, xz is no longer particularly useful, mostly annoying.For read-only long-term filesystem-like archives, use squashfs with whatever compression option is convenient. The cool thing about squashfs is you can efficiently list the contents and extract single files. This is why it's an ultimate best option for long-term filesystem-esque archives.<a href="https://en.m.wikipedia.org/wiki/SquashFS" rel="nofollow">https://en.m.wikipedia.org/wiki/SquashFS</a>For everything else, or if you want (very) fast compression/decompression and/or general high-ratio compression, use .zstd.<a href="https://github.com/facebook/zstd" rel="nofollow">https://github.com/facebook/zstd</a>I've used this combination professionally to great effect. You're welcome :)

评论 #32210758 未加载

评论 #32212437 未加载

评论 #32211008 未加载

评论 #32212870 未加载

评论 #32211141 未加载

评论 #32211527 未加载

评论 #32212071 未加载

评论 #32210568 未加载

评论 #32214197 未加载

评论 #32212363 未加载

评论 #32211296 未加载

评论 #32210580 未加载

评论 #32213797 未加载

评论 #32211651 未加载

评论 #32211401 未加载

saurikalmost 3 years ago

It was really disappointing when dpkg actively deprecated support (which I feel they should never do) for lzma format archives and went all-in on xz. The decompressor now needs the annoying flexibility mentioned in this article and the only benefit of the format--the ability to do random access on the file--is almost entirely defeated by dpkg using it to compress a tar file (which barely supports any form of accelerated access even when uncompressed; like the best you can do is kind of attempt to skip through file headers, which only helps if the files in the archive are large enough) and, to add insult to injury, the files are now all slightly larger to account for the extra headers :/.Regardless, this is a pretty old article and if you search for it you will find a number of discussions that have already happened about it that all have a bunch of comments.<a href="https://news.ycombinator.com/item?id=20103255" rel="nofollow">https://news.ycombinator.com/item?id=20103255</a><a href="https://news.ycombinator.com/item?id=16884832" rel="nofollow">https://news.ycombinator.com/item?id=16884832</a><a href="https://news.ycombinator.com/item?id=12768425" rel="nofollow">https://news.ycombinator.com/item?id=12768425</a>

评论 #32210872 未加载

jl6almost 3 years ago

I disagree with the premise of the article. Archive formats are all inadequate for long-term resilience and making them adequate would be a violation of the “do one thing and do it right” principle.To support resilience, you don’t need an alternative to xz, you need hashes and forward error correction. Specifically, compress your file using xz for high compression ratio, optionally encrypt it, then take a SHA-256 hash to be used for detecting errors, then generate parity files using PAR[1] or zfec[2] to be used for correcting errors.[1] <a href="https://wiki.archlinux.org/title/Parchive" rel="nofollow">https://wiki.archlinux.org/title/Parchive</a>[2] <a href="https://github.com/tahoe-lafs/zfec" rel="nofollow">https://github.com/tahoe-lafs/zfec</a>

habermanalmost 3 years ago

Folks seem to be comparing xz to zstd, but if I am understanding correctly the true competitor to xz is the article author’s “lzip” format, which uses the same LZMA compression as xz but with a much better designed container format (at least according to the author).

评论 #32212387 未加载

评论 #32214839 未加载

EdSchoutenalmost 3 years ago

The vast majority of the discussion is around xz’s inability of dealing with corrupted data. That said, couldn’t you argue that that needs to be solved at a lower level (storage, transport)? I’m not convinced the compression algorithm is the right place to tackle this.Just use a file system that does proper integrity checking/resilvering. Also use TLS to transfer data over the network.

评论 #32210701 未加载

评论 #32210864 未加载

jmillikinalmost 3 years ago

The article spends a lot of time discussing XZ's behavior when reading corrupt archives, but in practice this is not a case that will ever happen.Say you have a file `m4-1.4.19.tar.xz` in your source archive collection. The version I just downloaded from ftp.gnu.org has the SHA-256 checksum `63aede5c6d33b6d9b13511cd0be2cac046f2e70fd0a07aa9573a04a82783af96`. The GNU site doesn't have checksums for archives, but it does have a PGP signature file, so it's possible to (1) verify the archive signature and (2) store the checksum of the file.If that file is later corrupted, the corruption will be detected via the SHA-256 checksum. There's no reason to worry about whether the file itself has an embedded CRC-whatever, or the behavior with regards to variable-length integer fields. If a bit gets flipped then the checksum won't match, and that copy of the file will be discarded (= replaced with a fresh copy from replicated storage) before it ever hits the XZ code.If this is how the lzip format was designed -- optimized for a use case that doesn't exist -- it's no wonder it sees basically no adoption in favor of more pragmatic formats like XZ or Zstd.

评论 #32210866 未加载

评论 #32210870 未加载

评论 #32211302 未加载

KronisLValmost 3 years ago

Archival formats have always been of interest to me, given the very practical need to store a large amount of backups across any number of storage mediums - documents, pictures, music, sometimes particularly good movies, even the occasional software or game installer.Right now, I've personally settled on using the 7z format: <a href="https://en.wikipedia.org/wiki/7z" rel="nofollow">https://en.wikipedia.org/wiki/7z</a>The decompression speeds feel good, the compression ratios also seem better than ZIP and somehow it still feels like a widely supported format, with the 7-Zip program in particular being nice to use: <a href="https://en.wikipedia.org/wiki/7-Zip" rel="nofollow">https://en.wikipedia.org/wiki/7-Zip</a>Of course, various archivers on *nix systems also seem to support it, so so far everything feels good. Though of course having the chance of an archive getting corrupt and no longer being properly able to decompress it and read all of those files, versus just using the filesystem and having something like that perhaps occur to a single file still sometimes bothers me.Then again, on a certain level, I guess nothing is permanent and at least it's possible to occasionally test the archives for any errors and look into restoring them from backups, should something like that ever occur. Might just have to automate those tests, though.Yet, for the most part, going with an exceedingly boring option like that seems like a good idea, though the space could definitely use more projects and new algorithms for even better compression ratios, so at the very least it's nice to see attempts to do so!

评论 #32212427 未加载

rakooalmost 3 years ago

A format for archives, that is contractually built to last, has an impressive test suite, is easily browseable, where blobs can be retrieved individually if needed and is already known everywhere ?Sounds like SQlite, yet again: <a href="https://www2.sqlite.org/sqlar.html" rel="nofollow">https://www2.sqlite.org/sqlar.html</a>

评论 #32213010 未加载

latchkeyalmost 3 years ago

I routinely distribute a ~5meg .xz (~20meg uncompressed) to 20k+ servers across multiple physical data centers on a regular basis. Haven't seen a single failure. It ends up being 1.85megs smaller than the tgz version. Unless someone comes up with a better solution (ie: smaller), I probably won't change that any time soon.

评论 #32211387 未加载

评论 #32211603 未加载

评论 #32210867 未加载

评论 #32210892 未加载

poser-boyalmost 3 years ago

I'd recommend checking out zpaq[1], it purposed for backups, and has great compression (even on low setting) for large 100GB+ file collections. However for smaller stuff I use zstd at level 22 in a tar for most things since it's much faster, though a little heavier.[1] <a href="http://mattmahoney.net/dc/zpaq.html" rel="nofollow">http://mattmahoney.net/dc/zpaq.html</a>

评论 #32213277 未加载

yellowapplealmost 3 years ago

One thing that seems to be unmentioned so far in the conversation: xz is public domain, while lzip is subject to the full-blown GPL (v2 or later).In any case, I don't really bother with compression for my own archival needs. Storage is cheap, and encrypted data is kinda hard to reasonably compress anyway.

评论 #32211394 未加载

评论 #32211661 未加载

stop50almost 3 years ago

Interesting, i used xz till today for compression, but i think i will use gzip and zstd ffrom now on.

评论 #32210572 未加载

评论 #32211665 未加载

fay59almost 3 years ago

This is fairly old. When it came up last time, there were robust arguments that xz was characterized unfairly, and that the author’s format wasn’t very good at recovering in most cases either.

gorgoileralmost 3 years ago

Can you serialise a ZFS filesystem into a disk image? I feel like ZFS is the leader in data integrity, redundancy, and compression?

评论 #32224881 未加载

sbf501almost 3 years ago

How about just not compressing things for archival? A few bit errors in uncompressed files would end up as just a few bad characters. Whereas a few errors in an uncorrectable compression format might render the entire content useless. Sure they files are huge, but were talking about long term archival. In fact, if the documents are that important, have RAID-style redundancy and multiple-bit ECC in multiple geographic locations as well.

评论 #32216131 未加载

Quequaualmost 3 years ago

Could anyone recommend a broad scoped evaluation of current compression / archiving formats / algorithms which explores their various merits and failings?

INTPenisalmost 3 years ago

xz got a bit of hype there about 10 years ago, I used it until a couple years ago when I noticed how slow it was with huge DB dumps and how much faster zstd was while still having decent compression.So I have no idea about all this low level stuff, I just know that zstd is overall better for sysadmins.But next time I'm doing any sort of scripting that involves compression I'll take a look at squashfs now due to this thread.

danbmil99almost 3 years ago

Here's a thought: vinyl.While I haven't done intensive research on this, it occurs to me that plastic lasts a long time. Vinyl records are a format that seems fit for long-term archiving. The format is so obvious that it could be reverse engineered by any future civilization.So at least they'll know something about our taste in music.

gfodyalmost 3 years ago

about 4 years ago I had to choose a compression format for streaming database backups, and so I compared every option supported by 7z and it was xz that was the best compromise between performance and compression ratio

tex0almost 3 years ago

This serves as another example to me that governance and conflict resolution in the Debian project is really poor.Maintainers are free to do whatever they want, even if it doesn't make any sense at all.

评论 #32215262 未加载

m463almost 3 years ago

I was surprised to find that .tar format has no checksums/crc. You need .tar.gz to get a crc during compression.

throwmemoneyalmost 3 years ago

No one seems to mention lrzip ?

justincliftalmost 3 years ago

(2016)

stjohnswartsalmost 3 years ago

At least it isn't considered harmful

waynesonfirealmost 3 years ago

Wow xz just got owned. How do you recover from this.

评论 #32210980 未加载