There are compressors specialized for FASTQ that are faster and denser than zstd. FASTQ is the the most common format for storing DNA sequencing data-- a text file including metadata, sequence, confidence scores, etc. <a href="http://kirr.dyndns.org/sequence-compression-benchmark/" rel="nofollow">http://kirr.dyndns.org/sequence-compression-benchmark/</a><p>fastqz (not the best name!) supports reference compression too, for smaller files when they're reads compared to a reference genome.<p>Here's an 2013 paper considering the problem: <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190" rel="nofollow">https://journals.plos.org/plosone/article?id=10.1371/journal...</a><p>Still, zstd is widely available and a simple drop-in replacement for gzip.
Yep, I've been advocating zstd since I started working in the field. It compresses and decompresses so much faster than xz and is much much more compact than gzip.
All public SARS-CoV-2 consensus sequences are ~300GB uncompressed. If you compress using gzip you end up with ~30GB. With xz/zstd you can get it down to ~2GB. However xz takes ~40min to uncompress, whereas zstd can do it in ~8min.
Genome storage and compression is interesting. Most long sequences have a lot of internal redundancy/repetition that can be effectively compressed by standard algorithms, but something tuned specifically to the task, can do better. Then there is the question of storing many sequences. Whole genome sequences of thousands of bacteria from an experiment, for example. Each is almost identical to the other. This benefits from an algorithm that can compress that efficiently in terms of space, but also allows any part to be recalled random-access reasonably efficiently in terms of time. It'd also be nice to be able to add more genes to the database without having to re-compress the entire thing.<p>Wikipedia has a page on the topic: <a href="https://en.wikipedia.org/wiki/Compression_of_genomic_sequencing_data" rel="nofollow">https://en.wikipedia.org/wiki/Compression_of_genomic_sequenc...</a>
I have some questions for folks who are working in this field. In particular, are you holding onto your FASTQs for a long time (> 1 year), and if so, why? Is max compression, or ease of analysis more important (IE, do you access the data during the retention period, do you ETL it to another format, etc). How much of your budget represents storage costs which could be affected by compression?<p>I'm curious because in the past I've seen people are very cost sensitive but also want to keep lots of data that seems to go mostly unused.
I'm not sure why gzip still pops up for FASTQ data, as it is quite easy to bin the quality scores, align it against a reference genome and compress it as e.g. CRAM [1,2].<p>With 8 bins, the variant calling accuraccy seems to be preserved, while drastically reducing the file size.<p>[1]: <a href="https://en.wikipedia.org/wiki/CRAM_%28file_format%29" rel="nofollow">https://en.wikipedia.org/wiki/CRAM_%28file_format%29</a><p>[2]: <a href="https://lh3.github.io/2020/05/25/format-quality-binning-and-file-sizes" rel="nofollow">https://lh3.github.io/2020/05/25/format-quality-binning-and-...</a>
I'm unfamiliar with the bioinformatics scene, but I can imagine there's a lot of "legacy" hardware and software lying around that can't easily be updated to support superior formats.<p>Perhaps an interim solution would be something like "zstd+metadata", where the metadata is sufficient to transparently and efficiently reconstruct a gzip-compressed file on-demand. (similarly to how JPEG-XL allows oldschool JPEGs to be recompressed without losing any data)<p>This could have a bit of compute overhead, but since gzip decompression is so single-threaded I think you could do the conversion in parallel without actually slowing down the hot path. So, the performance would be approximately the same as using gzip (assuming decompression is compute-bound, not io-bound), but with all the storage benefits of using zstd.
The best storage format for genomics data is DNA. At some point in the future, it might actually be cheaper to just re-sequence than to store the fastq. Instead of optimizing the digital infrastructure, it might be better to just optimize the lab infrastructure for storing the physical DNA.
One thing that's missing from the article when comparing to `pigz` is that you can use the `-T0` flag in `zstd` and it will parallelize according to the number of CPUs. On some limited benchmarks, I found it to be much faster and worth using. Some `zstd` installations come with `zstdmt`[0] as an alias to `zstd -T0`.<p>[0]: <a href="https://manpages.debian.org/bullseye/zstd/zstdmt.1.en.html" rel="nofollow">https://manpages.debian.org/bullseye/zstd/zstdmt.1.en.html</a>
Glad to see zstd getting some love here. At a previous job I needed to capture and store large amounts of JSON about social media. I did an extensive compression bake-off taking into account both compression and storage costs, and for us the winner was zstd with the compression dial turned up a fair bit. (Sorry I don't have numbers here, but they're all in the hands of the previous employer. But I think we settled on -13 as cost-optimal if we were storing data on AWS for a year.)
Who would implement this suggestion? Is it an appeal to folks at large writing tools that interact with genomics data?<p>Due to higher decompression cost, the opportunity here seems localized to long-term storage. It feels like it would make more sense as a project (or product!) that implemented an efficient long-term archive (perhaps with a less compressed LRU cache in front).
zstd -10 is so very fast and so much better than gz that I am surprised every time I find someone still using gz for large files.<p>When I can get away with -16 for large file long term storage, I use it.
I'm not in the bioinformatic domain but maybe there are some legacy tools that depend on some specifics of gzip.<p>I was thinking something like the possibility to seek through a compressed file (after having built an index). Your DNA file is probably stored on a shared folder somewhere (if it's big you probably don't want to copy it to every workstation) and you point your software to the gz file directly, it create a seek-index on the first use, and then when you need view only a small section of the file you don't have to extract the whole file.<p>Some more advanced compression like zstd might use some form of dictionary to do the compression/decompression, which may be bigger (up to 32MB? max dictionary size for zstd whereas lz77 used by gzip use a dictionary based on a sliding window of the last 32K token) which you may have to transfer.
A few years ago I built some tools <a href="https://github.com/tf318/tamtools">https://github.com/tf318/tamtools</a> to store alignments against two different reference assemblies in an efficient way (taking advantage of the fact that the majority of each alignment to different assemblies would in fact be the same, just shifted in position).<p>The intent was to enhance this to store alignments against multiple references as new references are published, and probably to rewrite in Rust or C rather than the initial Python version.<p>In retrospect I would be interested to know whether this domain-specific compression effort, with zstd to the resulting "hybrid" alignment, would be more efficient than just letting zstd do its own thing with a full set of individual alignments against the different references.
zstd has a long range mode, which lets it find redundancies a gigabyte away. Try --long and --long=31 for very long range mode.<p>zstd has delta / patch mode, which creates a file that stores the "patch" to create a new file from an old (reference) file. See <a href="https://github.com/facebook/zstd/wiki/Zstandard-as-a-patching-engine">https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...</a><p>See the man page: <a href="https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md">https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md</a>
If you have re-sequencing data of model species (which applies to >80% of generated sequencing data), the storage issue is often solved using CRAM/BAM formats. The FASTQ can be reconstructed if unmapped reads are stored in the file.<p>More general (pre-alignment) sequence compression methods never really took of (e.g. <a href="https://github.com/BEETL/BEETL">https://github.com/BEETL/BEETL</a>). Probably because it helps so much to have common format that most workflows can start with. Here, the replacement of gzip with zstd would be a lower hanging fruit.
Compression of FASTQ files can be greatly improved by sorting/clustering the reads. I use clumpify from BBMap for that. The bad: clumpify does not support zstd at this point.
Many DNA specialized formats are delta from a "baseline" model to the actually described DNA strand.<p>`zstd` has a `--patch-from` mode, which does essentially the same thing, computing a new file using another file (typically an older version) as a reference. When similarities are larges, it leads to a huge reduction in compressed content.<p>I wonder what would be the performance of `zstd --patch-from=` when employing the same reference as these specialized DNA compressors.
I don't know anything about DNA files, but I wonder how gzip (deflate) would fare if one of the non-default compression strategies were used.<p>e.g. the RLE strategy is good for data with many compressible patterns, while the filtered or Huffman-only strategies are good for data with few compressible patterns.<p>The default strategy is a balance between the two extremes, which isn't tuned for a specific kind of input.
I'm slightly surprised that speed and file size would be the only considerations. For an archival format wouldn't you at least want a format with very robust error detection, even error correction? None of the proposed formats have this, basically just having a simple 4 byte CRC or XXHASH. Some kind of random access to the compressed file might be useful too (which xz has).
The Zarr format is used in some genomics workflows (see <a href="https://github.com/zarr-developers/community/issues/19">https://github.com/zarr-developers/community/issues/19</a>) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)
This seems silly. Any company that is in the business of storing tons and tons of DNA data will probably be recompressing it for archive purposes using better algorithms already. If they're not, their loss.<p>For smaller players where maximum tool interoperability is important, gzip seems good enough.
As someone who helped build a fastq data processing flow for a genetics company during the sequencing price drop era, iirc we had issues with data corruption with some of the other formats in tests. Funny to think, hey, I got my first taste of very large data management because of this!
I wonder if making using of zstd's long mode (`--long`) and/or multithreading support (`-T0` to use all threads) would close the displayed performance gap with `pigz`. Doesn't seem like either were used, which is odd considering the comparison made and the file used.
Are these files already just diffs from baseline/standard full DNA scans, or are they including <i>all</i> the data? Seems like most scans will only differ a little…
Is there any good work on NN based compression for DNA?<p>For those not aware one can, in general, use a predictive model to compress data using arithmetic coding.
A much bigger problem than the compression format is the fact that biologists over-sequence. You don't need 500x coverage, trust me, please learn to design your experiments.