TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Data Deduplication with Linux

50 pointsby chintanpover 13 years ago

9 comments

wazooxover 13 years ago
For those interested, I've done lots of lessfs testing published on my professional blog a while ago :<p>* first post: <a href="http://blogs.intellique.com/tech/2010/12/22#dedupe" rel="nofollow">http://blogs.intellique.com/tech/2010/12/22#dedupe</a><p>* detailed setup and benchmark results: <a href="http://blogs.intellique.com/tech/2011/01/03#dedupe-config" rel="nofollow">http://blogs.intellique.com/tech/2011/01/03#dedupe-config</a><p>After more than 9 months running lessfs, I recommend it.
chintanpover 13 years ago
A required reading from my course on Advanced Storage Systems at CMU, <a href="http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf" rel="nofollow">http://www.cs.cmu.edu/~15-610/READINGS/optional/zhu2008.pdf</a><p>Really good paper which describes in detail how the deduplication works.
ak217over 13 years ago
So, from what I understand, this is great but more of a proof of concept since fuse performance kills it. As far as putting it in production, there are a few unresolved questions which I haven't seen picked apart:<p>- Can dedup be integrated into the VFS layer, like unionfs is shooting for, or does it have be integrated with the underlying filesystem.<p>- Is online dedup possible, and does the answer change when running SSD.<p>- What's the best granularity (block-level? inode-level? block extent-level?) and how badly can it randomize the i/o. I imagine one would have to do a lot of real-world benchmarking to find this out.<p>- Are there possible privacy issues (i.e. finding through i/o patterns whether someone else has a given block or file stored) and how to deal with them
res0nat0rover 13 years ago
Bup is also a pretty cool git based ddup backup utility:<p><a href="https://github.com/apenwarr/bup#readme" rel="nofollow">https://github.com/apenwarr/bup#readme</a>
viraptorover 13 years ago
I was wondering - with the current amount of abstraction and similar (sometimes redundant) metadata on almost everything - what percent of duplicate blocks could be found on a standard desktop system?<p>I don't think it would be useful, I'm just interested in the level of "standard" data duplication.
评论 #2932809 未加载
评论 #2934960 未加载
makmanalpover 13 years ago
btrfs also has a deduplication feature in the works: <a href="http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg07720.html" rel="nofollow">http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg0...</a>
tobias3over 13 years ago
I tested it and I don't recommend it. (It was like a year ago though) It was really slow and some blog posts about the reliability of the data storage backend were a little bit scary.<p>I would recommend using zfs-fuse. You don't have the FUSE-&#62;File on a filesystem-&#62;Hard disk indirection (thus more speed). And additionaly you get all the cool ZFS features! If you need even more speed there is a ZFS kernel module for linux and a dedup patch for btrfs. I don't think those are production ready though.
评论 #2932582 未加载
评论 #2933811 未加载
aleccoover 13 years ago
I don't understand the complication of using a database. The sensible approach would be something like BMDiff with [page] indexing on top for random access.
评论 #2933494 未加载
wcoenenover 13 years ago
lessfs appears to do block level deduplication (like ZFS). This means that if I copy a huge file but add a few bytes at the start, I won't get any benefit from deduplication because the data doesn't align anymore with the original block boundaries.<p>I wonder if there is a way to improve on that?