> I started working on DwarFS in 2013 and my main use case and major motivation was that I had several hundred different versions of Perl that were taking up something around 30 gigabytes of disk space, and I was unwilling to spend more than 10% of my hard drive keeping them around for when I happened to need them.<p>It fills me with joy that someone has been coding a fs for 7 years due to perl installs taking too much space. Necessity is the mother of all invention.
It looks like the benefit is some kind of block or file deduplication.<p>@OP: Can you please explain why you keep 50 gigs of perl around? :-)<p>I use compressed read-only file systems all the time to save space on my travel laptop. I have one squashfs for firefox, one for the TeX base install, one for LLVM, one for qemu, one for my cross compiler collection. I suspect the gains over squashfs will be far less pronounced than for the pathological "400 perl version".
Whew! It was easy to find out how you actually initialize this thing, if it's read-only:<p><a href="https://github.com/mhx/dwarfs/blob/main/man/mkdwarfs.md" rel="nofollow">https://github.com/mhx/dwarfs/blob/main/man/mkdwarfs.md</a>
Perhaps not strictly on-topic, but is there any equivalent FS/program in Windows that will allow users to have read-only access to files that are deduplicated in some way?<p>My use case is the MAME console archives, which are now full of copies of games from different localisations with 99% identical content. 7Z will compress them together and deduplicate, but breaks once the archive exceeds a few gigs.<p>These archives are already compressed (CHD format, which is 7Z + FLAC for ISOs), but it's deduplication that needs to happen on top of these already compressed files that I'm struggling with.<p>Sorry for the off-topic ask!
Neat! I'd like to see benchmarks for more typical squashfs payloads-- embedded root filesystems totalling under 100MB. Small docker images like alpine would be a decent proxy. The given corpus of thousands of perl versions is more appropriate for comparison against git.
I wish there was a semi-compressed transparent filesystem layer which slowly compresses the least recently used files in the background, and un-compresses files upon use. That way you could store much more mostly unused content than space on the disk, without sacrificing accessibility.
mksquashfs supports gzip, xz, lzo, lz4 and zstd too, you can also compile it to have any of those as a default instead of gzip.<p>Does the performance benchmark show DwarFS versus single-threaded gzip compressed SquashFS?
Is this viable as a backup/archive format? Would it make sense to e.g. have an incremental backup as a DwarFS file, referring to the base backup in another DwarFS file?
This could be awesome for compressing Docker image layers. After all, they can be huge (hundreds of MB) and, if the Dockerfile is well organized, each step should contain a fairly homogeneous set of files (like apt-get artifacts, for example).
It would amazing to see this work on OpenWRT, I think it would fit perfectly using less resources than squashfs.
The other location would be on a Raspberry pi for scenarios where power can be cut at any time.
Does anyone remember back in the 90s when we'd install DoubleSpace to get on the fly compression? And then they built it into MSDOS 6 and that was a major game changer?
Oh wow. This would be excellent for language dependencies - ruby gems, node_modules, etc. Integrating this with something like pnpm [1], which already keeps a global store of dependencies would excellent.
[1] - <a href="https://pnpm.js.org" rel="nofollow">https://pnpm.js.org</a>
So I tried it out on my 17BG of perl builds. (just on my laptop, not on my big machine).<p>mkdwarfs crashed with recursive links (1-level, just pointing to itself) and when I removed dirs while running mkdwarfs, which were part of of the input path. Which is fair, I assume.
I noticed that enabling compression on zfs made a <i>huge</i> difference with the source size of some of my largely text file petitions. I never turned on deduplication because I don’t want to bother with the memory overhead, but I bet that would help even further.
I'm curious, why do you have so many perl installations around. I thought I'd got a fair number of python venvs kicking around for each of the repos I'm dealing with, but nowhere near that many.
Circa 2 years ago, I was working on a side project and got so annoyed with SquashFS tooling, that I decided to fix it instead. After getting stuck with the spaghetti code behind mksquashfs, I decided to start from scratch, having learnt enough about SquashFS to roughly understand the on-disk format.<p>Because squashfs-tools seemed pretty unmaintained in late 2018 (no activity on the official site & git tree for years and only one mailing list post "can you do a release?" which got a very annoyed response) I released my tooling as "squashfs-tools-ng" and it is currently packaged by a hand full of distros, including Debian & Ubuntu.[1]<p>I also thoroughly documented the on-disk format, after reverse engineering it[2] and made a few benchmarks[3].<p>For my benchmarks I used an image I extracted from the Debian XFCE LiveDVD (~6.5GiB as tar archive, ~2GiB as XZ compressed SquashFS image). By playing around a bit, I also realized that the compressed meta data is "amazingly small", compared to the actual image file data and the resulting images are very close to the tar ball compressed with the same compressor settings.<p>I can accept a claim of being a little smaller than SquashFS, but the claimed difference makes me very suspicious. From the README, I'm not quite sure: Does the Raspbian image comparison compare XZ compression against SquashFS with Zstd?<p>I have cloned the git tree and installed dozens of libraries that this folly thingy needs, but I'm currently swamped in CMake errors (haven't touched CMake in 8+ years, so I'm a bit rusty there) and the build fails with some <i>still</i> missing headers. I hope to have more luck later today and produce a comparison on my end using my trusty Debian reference image which I will definitely add to my existing benchmarks.<p>Also, is there any documentation on how the on-disk format for DwarFS and it's packing works which might explain the incredible size difference?<p>[1] <a href="https://github.com/AgentD/squashfs-tools-ng" rel="nofollow">https://github.com/AgentD/squashfs-tools-ng</a><p>[2] <a href="https://github.com/AgentD/squashfs-tools-ng/blob/master/doc/format.txt" rel="nofollow">https://github.com/AgentD/squashfs-tools-ng/blob/master/doc/...</a><p>[3] <a href="https://github.com/AgentD/squashfs-tools-ng/tree/master/doc" rel="nofollow">https://github.com/AgentD/squashfs-tools-ng/tree/master/doc</a>
> You can pick either clang or g++, but at least recent clang versions will produce substantially faster code<p>have you investigated why this might be the case?