We're wasting money by only supporting gzip for raw DNA files

148 pointsby michaelbartonover 2 years ago

32 comments

Scaevolusover 2 years ago

There are compressors specialized for FASTQ that are faster and denser than zstd. FASTQ is the the most common format for storing DNA sequencing data-- a text file including metadata, sequence, confidence scores, etc. <a href="http://kirr.dyndns.org/sequence-compression-benchmark/" rel="nofollow">http://kirr.dyndns.org/sequence-compression-benchmark/</a>fastqz (not the best name!) supports reference compression too, for smaller files when they're reads compared to a reference genome.Here's an 2013 paper considering the problem: <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190" rel="nofollow">https://journals.plos.org/plosone/article?id=10.1371/journal...</a>Still, zstd is widely available and a simple drop-in replacement for gzip.

评论 #34314211 未加载

croemerover 2 years ago

Yep, I've been advocating zstd since I started working in the field. It compresses and decompresses so much faster than xz and is much much more compact than gzip. All public SARS-CoV-2 consensus sequences are ~300GB uncompressed. If you compress using gzip you end up with ~30GB. With xz/zstd you can get it down to ~2GB. However xz takes ~40min to uncompress, whereas zstd can do it in ~8min.

评论 #34314813 未加载

评论 #34314432 未加载

评论 #34313816 未加载

评论 #34313936 未加载

评论 #34313775 未加载

评论 #34317313 未加载

评论 #34314501 未加载

评论 #34313759 未加载

评论 #34314751 未加载

retracover 2 years ago

Genome storage and compression is interesting. Most long sequences have a lot of internal redundancy/repetition that can be effectively compressed by standard algorithms, but something tuned specifically to the task, can do better. Then there is the question of storing many sequences. Whole genome sequences of thousands of bacteria from an experiment, for example. Each is almost identical to the other. This benefits from an algorithm that can compress that efficiently in terms of space, but also allows any part to be recalled random-access reasonably efficiently in terms of time. It'd also be nice to be able to add more genes to the database without having to re-compress the entire thing.Wikipedia has a page on the topic: <a href="https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data" rel="nofollow">https://en.wikipedia.org/wiki/Compression_of_genomic_sequenc...</a>

dekhnover 2 years ago

I have some questions for folks who are working in this field. In particular, are you holding onto your FASTQs for a long time (> 1 year), and if so, why? Is max compression, or ease of analysis more important (IE, do you access the data during the retention period, do you ETL it to another format, etc). How much of your budget represents storage costs which could be affected by compression?I'm curious because in the past I've seen people are very cost sensitive but also want to keep lots of data that seems to go mostly unused.

评论 #34314227 未加载

kxc42over 2 years ago

I'm not sure why gzip still pops up for FASTQ data, as it is quite easy to bin the quality scores, align it against a reference genome and compress it as e.g. CRAM [1,2].With 8 bins, the variant calling accuraccy seems to be preserved, while drastically reducing the file size.[1]: <a href="https://en.wikipedia.org/wiki/CRAM_%28file_format%29" rel="nofollow">https://en.wikipedia.org/wiki/CRAM_%28file_format%29</a>[2]: <a href="https://lh3.github.io/2020/05/25/format-quality-binning-and-file-sizes" rel="nofollow">https://lh3.github.io/2020/05/25/format-quality-binning-and-...</a>

评论 #34314165 未加载

评论 #34315237 未加载

Retr0idover 2 years ago

I'm unfamiliar with the bioinformatics scene, but I can imagine there's a lot of "legacy" hardware and software lying around that can't easily be updated to support superior formats.Perhaps an interim solution would be something like "zstd+metadata", where the metadata is sufficient to transparently and efficiently reconstruct a gzip-compressed file on-demand. (similarly to how JPEG-XL allows oldschool JPEGs to be recompressed without losing any data)This could have a bit of compute overhead, but since gzip decompression is so single-threaded I think you could do the conversion in parallel without actually slowing down the hot path. So, the performance would be approximately the same as using gzip (assuming decompression is compute-bound, not io-bound), but with all the storage benefits of using zstd.

评论 #34313863 未加载

评论 #34315339 未加载

评论 #34314207 未加载

daemonkover 2 years ago

The best storage format for genomics data is DNA. At some point in the future, it might actually be cheaper to just re-sequence than to store the fastq. Instead of optimizing the digital infrastructure, it might be better to just optimize the lab infrastructure for storing the physical DNA.

评论 #34316506 未加载

firecrakerover 2 years ago

To those who don't know.. file formats in genetics is already a big mess.

评论 #34320451 未加载

评论 #34314610 未加载

polyrandover 2 years ago

One thing that's missing from the article when comparing to `pigz` is that you can use the `-T0` flag in `zstd` and it will parallelize according to the number of CPUs. On some limited benchmarks, I found it to be much faster and worth using. Some `zstd` installations come with `zstdmt`[0] as an alias to `zstd -T0`.[0]: <a href="https://manpages.debian.org/bullseye/zstd/zstdmt.1.en.html" rel="nofollow">https://manpages.debian.org/bullseye/zstd/zstdmt.1.en.html</a>

wpietriover 2 years ago

Glad to see zstd getting some love here. At a previous job I needed to capture and store large amounts of JSON about social media. I did an extensive compression bake-off taking into account both compression and storage costs, and for us the winner was zstd with the compression dial turned up a fair bit. (Sorry I don't have numbers here, but they're all in the hands of the previous employer. But I think we settled on -13 as cost-optimal if we were storing data on AWS for a year.)

evrydayhustlingover 2 years ago

Who would implement this suggestion? Is it an appeal to folks at large writing tools that interact with genomics data?Due to higher decompression cost, the opportunity here seems localized to long-term storage. It feels like it would make more sense as a project (or product!) that implemented an efficient long-term archive (perhaps with a less compressed LRU cache in front).

评论 #34319649 未加载

评论 #34315586 未加载

评论 #34315093 未加载

xbarover 2 years ago

zstd -10 is so very fast and so much better than gz that I am surprised every time I find someone still using gz for large files.When I can get away with -16 for large file long term storage, I use it.

评论 #34313843 未加载

sl-doltover 2 years ago

You can stream gzipped files to work with them out of memory. Can you stream this? Is that even a use-case to consider?

GistNoesisover 2 years ago

I'm not in the bioinformatic domain but maybe there are some legacy tools that depend on some specifics of gzip.I was thinking something like the possibility to seek through a compressed file (after having built an index). Your DNA file is probably stored on a shared folder somewhere (if it's big you probably don't want to copy it to every workstation) and you point your software to the gz file directly, it create a seek-index on the first use, and then when you need view only a small section of the file you don't have to extract the whole file.Some more advanced compression like zstd might use some form of dictionary to do the compression/decompression, which may be bigger (up to 32MB? max dictionary size for zstd whereas lz77 used by gzip use a dictionary based on a sliding window of the last 32K token) which you may have to transfer.

评论 #34314822 未加载

riverdwellerover 2 years ago

A few years ago I built some tools <a href="https://github.com/tf318/tamtools">https://github.com/tf318/tamtools</a> to store alignments against two different reference assemblies in an efficient way (taking advantage of the fact that the majority of each alignment to different assemblies would in fact be the same, just shifted in position).The intent was to enhance this to store alignments against multiple references as new references are published, and probably to rewrite in Rust or C rather than the initial Python version.In retrospect I would be interested to know whether this domain-specific compression effort, with zstd to the resulting "hybrid" alignment, would be more efficient than just letting zstd do its own thing with a full set of individual alignments against the different references.

wolf550eover 2 years ago

zstd has a long range mode, which lets it find redundancies a gigabyte away. Try --long and --long=31 for very long range mode.zstd has delta / patch mode, which creates a file that stores the "patch" to create a new file from an old (reference) file. See <a href="https://github.com/facebook/zstd/wiki/Zstandard-as-a-patching-engine">https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...</a>See the man page: <a href="https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md">https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md</a>

mfldover 2 years ago

If you have re-sequencing data of model species (which applies to >80% of generated sequencing data), the storage issue is often solved using CRAM/BAM formats. The FASTQ can be reconstructed if unmapped reads are stored in the file.More general (pre-alignment) sequence compression methods never really took of (e.g. <a href="https://github.com/BEETL/BEETL">https://github.com/BEETL/BEETL</a>). Probably because it helps so much to have common format that most workflows can start with. Here, the replacement of gzip with zstd would be a lower hanging fruit.

elmolino89over 2 years ago

Compression of FASTQ files can be greatly improved by sorting/clustering the reads. I use clumpify from BBMap for that. The bad: clumpify does not support zstd at this point.

cornutopia2over 2 years ago

Many DNA specialized formats are delta from a "baseline" model to the actually described DNA strand.`zstd` has a `--patch-from` mode, which does essentially the same thing, computing a new file using another file (typically an older version) as a reference. When similarities are larges, it leads to a huge reduction in compressed content.I wonder what would be the performance of `zstd --patch-from=` when employing the same reference as these specialized DNA compressors.

charonn0over 2 years ago

I don't know anything about DNA files, but I wonder how gzip (deflate) would fare if one of the non-default compression strategies were used.e.g. the RLE strategy is good for data with many compressible patterns, while the filtered or Huffman-only strategies are good for data with few compressible patterns.The default strategy is a balance between the two extremes, which isn't tuned for a specific kind of input.

rwmjover 2 years ago

I'm slightly surprised that speed and file size would be the only considerations. For an archival format wouldn't you at least want a format with very robust error detection, even error correction? None of the proposed formats have this, basically just having a simple 4 byte CRC or XXHASH. Some kind of random access to the compressed file might be useful too (which xz has).

rabernatover 2 years ago

The Zarr format is used in some genomics workflows (see <a href="https://github.com/zarr-developers/community/issues/19">https://github.com/zarr-developers/community/issues/19</a>) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)

评论 #34314454 未加载

borlandover 2 years ago

This seems silly. Any company that is in the business of storing tons and tons of DNA data will probably be recompressing it for archive purposes using better algorithms already. If they're not, their loss.For smaller players where maximum tool interoperability is important, gzip seems good enough.

arminiusreturnsover 2 years ago

As someone who helped build a fastq data processing flow for a genetics company during the sequencing price drop era, iirc we had issues with data corruption with some of the other formats in tests. Funny to think, hey, I got my first taste of very large data management because of this!

phantopover 2 years ago

I wonder if making using of zstd's long mode (`--long`) and/or multithreading support (`-T0` to use all threads) would close the displayed performance gap with `pigz`. Doesn't seem like either were used, which is odd considering the comparison made and the file used.

评论 #34314848 未加载

zellynover 2 years ago

Are these files already just diffs from baseline/standard full DNA scans, or are they including all the data? Seems like most scans will only differ a little…

评论 #34313883 未加载

评论 #34313798 未加载

评论 #34313908 未加载

评论 #34315261 未加载

AbrahamParangiover 2 years ago

Is there any good work on NN based compression for DNA?For those not aware one can, in general, use a predictive model to compress data using arithmetic coding.

jbverschoorover 2 years ago

> The ztsd -15 command takes ~70s which is 50% longer than pigz -9 at ~35s.No, it’s 100% longer, twice as long, and the inverse is 50%

评论 #34324384 未加载

pinetroeyover 2 years ago

Are there any machines that can home sequence(machine @ home) for a resonable cost?

评论 #34317460 未加载

n4jm4over 2 years ago

Rewrite our DNA from quadature to base64. Much more efficient.

qclibre22over 2 years ago

gzip and xz are usually installed these days, zstd isn't.

评论 #34314577 未加载

评论 #34315409 未加载

wheresmycraisinover 2 years ago

A much bigger problem than the compression format is the fact that biologists over-sequence. You don't need 500x coverage, trust me, please learn to design your experiments.

评论 #34315714 未加载

评论 #34315371 未加载

评论 #34326187 未加载