For those interested, I've done lots of lessfs testing published on my professional blog a while ago :<p>* first post:
<a href="http://blogs.intellique.com/tech/2010/12/22#dedupe" rel="nofollow">http://blogs.intellique.com/tech/2010/12/22#dedupe</a><p>* detailed setup and benchmark results:
<a href="http://blogs.intellique.com/tech/2011/01/03#dedupe-config" rel="nofollow">http://blogs.intellique.com/tech/2011/01/03#dedupe-config</a><p>After more than 9 months running lessfs, I recommend it.
A required reading from my course on Advanced Storage Systems at CMU, <a href="http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf" rel="nofollow">http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf</a><p>Really good paper which describes in detail how the deduplication works.
So, from what I understand, this is great but more of a proof of concept since fuse performance kills it. As far as putting it in production, there are a few unresolved questions which I haven't seen picked apart:<p>- Can dedup be integrated into the VFS layer, like unionfs is shooting for, or does it have be integrated with the underlying filesystem.<p>- Is online dedup possible, and does the answer change when running SSD.<p>- What's the best granularity (block-level? inode-level? block extent-level?) and how badly can it randomize the i/o. I imagine one would have to do a lot of real-world benchmarking to find this out.<p>- Are there possible privacy issues (i.e. finding through i/o patterns whether someone else has a given block or file stored) and how to deal with them
Bup is also a pretty cool git based ddup backup utility:<p><a href="https://github.com/apenwarr/bup#readme" rel="nofollow">https://github.com/apenwarr/bup#readme</a>
I was wondering - with the current amount of abstraction and similar (sometimes redundant) metadata on almost everything - what percent of duplicate blocks could be found on a standard desktop system?<p>I don't think it would be useful, I'm just interested in the level of "standard" data duplication.
btrfs also has a deduplication feature in the works: <a href="http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg07720.html" rel="nofollow">http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg0...</a>
I tested it and I don't recommend it. (It was like a year ago though)
It was really slow and some blog posts about the reliability of the data storage backend were a little bit scary.<p>I would recommend using zfs-fuse. You don't have the FUSE->File on a filesystem->Hard disk indirection (thus more speed). And additionaly you get all the cool ZFS features!
If you need even more speed there is a ZFS kernel module for linux and a dedup patch for btrfs. I don't think those are production ready though.
I don't understand the complication of using a database. The sensible approach would be something like BMDiff with [page] indexing on top for random access.
lessfs appears to do block level deduplication (like ZFS). This means that if I copy a huge file but add a few bytes at the start, I won't get any benefit from deduplication because the data doesn't align anymore with the original block boundaries.<p>I wonder if there is a way to improve on that?