OpenZFS deduplication is good now and you shouldn't use it

454 点作者 type06 个月前

32 条评论

I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.Because:> When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.

评论 #42002643 未加载

评论 #42001483 未加载

评论 #42002155 未加载

评论 #42001360 未加载

评论 #42001903 未加载

评论 #42001330 未加载

评论 #42001923 未加载

评论 #42001970 未加载

评论 #42002290 未加载

评论 #42031327 未加载

UltraSane6 个月前

"And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.

评论 #42002295 未加载

评论 #42002898 未加载

评论 #42002071 未加载

评论 #42002181 未加载

评论 #42004065 未加载

评论 #42002029 未加载

评论 #42002040 未加载

评论 #42002951 未加载

评论 #42002948 未加载

评论 #42003105 未加载

评论 #42004129 未加载

评论 #42001929 未加载

评论 #42004977 未加载

simonjgreen6 个月前

We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]

评论 #42006683 未加载

评论 #42006121 未加载

nikisweeting6 个月前

I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.Big thank you to all the people that worked on it!BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.

评论 #42001510 未加载

评论 #42001446 未加载

评论 #42001693 未加载

dark-star6 个月前

I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).Just store fingerprints in a database and run through that at night and fixup the block pointers...

评论 #42001299 未加载

评论 #42001293 未加载

评论 #42004704 未加载

nabla96 个月前

You should use:<pre><code> cp --reflink=auto </code></pre> You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.

BodyCulture6 个月前

I wanted to use ZFS badly, but of course all data must be encrypted. It was surprising to see how usage gets much more complicated than expected and so many people just don’t encrypt their data because things get wild then.Look, even Proxmox, which I totally expected to support encryption with default installation (it has „Enterprise“ on the website) does loose important features when trying to use with encryption.Also please study the issue tracker, there are a few surprising things I would not have expected to exist in a productive file system.

评论 #42005618 未加载

klysm6 个月前

I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.

评论 #42001400 未加载

评论 #42001977 未加载

bastloing6 个月前

Forget dedupe just use zfs compression, a lot more bang for your buck

评论 #42001895 未加载

rodarmor6 个月前

General-purpose deduplication sounds good in theory but tends not to work out in practice. IPFS uses a rolling hash with variable-sized pieces, in an attempt to deduplicate data rysnc-style. However, in practice, it doesn't actually make a difference, and adds complexity for no reason.

tilt_error6 个月前

If writing performance is critical, why bother with deduplication at writing time? Do deduplication afterwards, concurrently and with lower priority?

评论 #42002053 未加载

评论 #42001867 未加载

评论 #42001994 未加载

评论 #42001327 未加载

rkagerer6 个月前

I'd love if dedicated hardware existing in disk controllers for calculating stuff like ECC could be enhanced to expose hashes of blocks to the system. Getting this for free for all your I/O would allow some pretty awesome things.

评论 #42007727 未加载

wpollock6 个月前

When the lookup key is a hash, there's no locality over the megabytes of the table. So don't all the extra memory accesses to support dedup affect the L1 and L2 caches? Has anyone at OpenZFS measured that?It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.

cmiller16 个月前

So if a sweet spot exists where dedup is widely beneficial then:Is there an easy way to analyze your dataset to find if you're in this sweet spot?If so, is anyone working on some kind of automated partial dedup system where only portions of the filesystem are dedupped based on analysis of how beneficial it would be?

评论 #42008386 未加载

评论 #42007961 未加载

xmodem6 个月前

We have a read-heavy zpool with some data that's used as part of our build process, on which we see a roughly 8x savings with dedup - and because of this ZFS dedup makes it economically viable for us to store the pool on NVMe rather than spinning rust.

评论 #42006847 未加载

nobrains6 个月前

What are the use cases where it makes sense to use de-dup? Backup comes to mind. What else?

评论 #42005307 未加载

评论 #42007771 未加载

评论 #42004972 未加载

watersb6 个月前

I've used ZFS dedupe for a personal archive since dedupe was first introduced.Currently, it seems to be reducing on-disk footprint by a factor of 3.When I first started this project, 2TB hard drives were the largest available.My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.

评论 #42003944 未加载

评论 #42003976 未加载

qwertox6 个月前

What happened to the issue with ZFS which occurred around half a year go?I never changed a thing (because it also had some cons) and am believing that as long as a ZFS scrub shows no errors, all is OK. Could I be not seeing a problem?

david_draco6 个月前

In addition to the copy_file_range discussion at the end, it would be great to be able to applying deduplication to selected files, identified by searching the filesystem for say >1MB files which have identical hash.

girishso6 个月前

Off topic, any tool to deduplicate files across different external Hard disks?Over the years I made multiple copies of my laptop HDD to different external HDDs, ended up with lots of duplicate copies of files.

评论 #42002466 未加载

评论 #42007790 未加载

forrestthewoods6 个月前

My dream Git successor would use either dedupe or a simple cache plus copy-on-write so that repos can commit toolchains and dependencies and users wouldn’t need to worry about disk drive bloat.Maybe someday…

评论 #42002868 未加载

UltraSane6 个月前

Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.

teilo6 个月前

Why are enterprise SANs so good at dedupe, but filesystems so bad? We use HPE Nimble (yeah, they changed the name recently but I can't be bothered to remember it), and the space savings are insane for the large filesystems we work with. And there is no performance hit.Some of this is straight up VM storage volumes for ESX virtual disks, some direct LUNs for our file servers. Our gains are upwards of 70%.

评论 #42009622 未加载

onnimonni6 个月前

I'm storing a lot of text documents (.html) which contain long similiar sections and are thus not copies but "partial copies".Would someone know if the fast dedup works also for this? Anything else I could be using instead?

hhdhdbdb6 个月前

Any timing attacks possible on a virtualized system using dedupe?Eg find out what my neighbours have installed.Or if the data before an SSH key is predictable, keep writing that out to disk guessing the next byte or something like that.

评论 #42003253 未加载

评论 #42007777 未加载

eek21216 个月前

So many flaws. I want to see the author repeat this across 100TB of random data from multiple clients. He/she/whatever will quickly realize why this feature exists. One scenario I am aware of that uses another filesystem in a cloud setup saved 43% of disk space by using dedupe.No, you won't save much on a client system. That isn't what the feature is made for.

评论 #42001900 未加载

评论 #42003576 未加载

评论 #42001944 未加载

评论 #42003312 未加载

nisten6 个月前

can someone smarter than me explain what happens when instead of the regular 4kb block size in kernel builds we use 16kb or 64kb block size or is that only for the memory part, i am confused. Will a larger block size make this thing good or bad?

评论 #42001969 未加载

tiffanyh6 个月前

OT: does anyone have a good way to dedupe iCloud Photos. Or my Dropbox photos?

评论 #42002179 未加载

评论 #42002201 未加载

评论 #42003689 未加载

评论 #42004733 未加载

merpkz6 个月前

I don't get it - many people here claim in this thread that VM base image deduplication is great use case for this. So lets assume there are couple of hundreds of VMs on a ZFS dataset with dedupe on, each of them ran by different people for different purposes entirely - some databases, some web frontends / backends, minio S3 storage or backups ect - this might save you those measly hundreds of megabytes for linux system files those VMs might have in common ( even though knowing how many linux versions are out there with different patch levels - unlikely ) it will still not be worth it considering ZFS will keep track of each users individual files - databases and backup files and whatnot - data which is almost guaranteed to be unique between users so it will completely miss the point of ZFS deduplication. What am I missing?

评论 #42004964 未加载

评论 #42004996 未加载

tjwds6 个月前

Edit: disregard this, I was wrong and missed the comment deletion window.

评论 #42001459 未加载

评论 #42001458 未加载

kderbe6 个月前

I clicked because of the bait-y title, but ended up reading pretty much the whole post, even though I have no reason to be interested in ZFS. (I skipped most of the stuff about logs...) Everything was explained clearly, I enjoyed the writing style, and the mobile CSS theme was particularly pleasing to my eyes. (It appears to be Pixyll theme with text set to the all-important #000, although I shouldn't derail this discussion with opinions on contrast ratios...)For less patient readers, note that the concise summary is at the bottom of the post, not the top.

评论 #42005508 未加载

评论 #42003973 未加载

评论 #42008072 未加载

burnt-resistor6 个月前

Already don't use ZoL because of their history of arms shrug-level support coupled with a lack of QA. ZoL != Solaris ZFS. It is mostly an aspirational cargo cult. Only a few fses like XFS and Ext4 have meaningful real-world, enterprise deployment hours. Technically, btrfs has significant (web ops instead of IT ops) deployment exposure due to its use on 10M boxes at Meta. Many non-mainstream fses also aren't assured to be trustworthy because of their low usage and prevalent lack of thorough, formalized QA. There's nothing wrong with experimentation, but it's necessary to have an accurate understanding of the risk budget for a given technology for a given use-case.

评论 #42008880 未加载

32 条评论

Wowfunhappy6 个月前

评论 #42002643 未加载

评论 #42001483 未加载

评论 #42002155 未加载

评论 #42001360 未加载

评论 #42001903 未加载

评论 #42001330 未加载

评论 #42001923 未加载

评论 #42001970 未加载

评论 #42002290 未加载

评论 #42031327 未加载

UltraSane6 个月前

评论 #42002295 未加载

评论 #42002898 未加载

评论 #42002071 未加载

评论 #42002181 未加载

评论 #42004065 未加载

评论 #42002029 未加载

评论 #42002040 未加载

评论 #42002951 未加载

评论 #42002948 未加载

评论 #42003105 未加载

评论 #42004129 未加载

评论 #42001929 未加载

评论 #42004977 未加载

simonjgreen6 个月前

评论 #42006683 未加载

评论 #42006121 未加载

nikisweeting6 个月前

评论 #42001510 未加载

评论 #42001446 未加载

评论 #42001693 未加载

dark-star6 个月前

评论 #42001299 未加载

评论 #42001293 未加载

评论 #42004704 未加载

nabla96 个月前

BodyCulture6 个月前

评论 #42005618 未加载

klysm6 个月前

I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.

评论 #42001400 未加载

评论 #42001977 未加载

bastloing6 个月前

Forget dedupe just use zfs compression, a lot more bang for your buck

评论 #42001895 未加载

rodarmor6 个月前

tilt_error6 个月前

If writing performance is critical, why bother with deduplication at writing time? Do deduplication afterwards, concurrently and with lower priority?

评论 #42002053 未加载

评论 #42001867 未加载

评论 #42001994 未加载

评论 #42001327 未加载

rkagerer6 个月前

评论 #42007727 未加载

wpollock6 个月前

cmiller16 个月前

评论 #42008386 未加载

评论 #42007961 未加载

xmodem6 个月前

评论 #42006847 未加载

nobrains6 个月前

What are the use cases where it makes sense to use de-dup? Backup comes to mind. What else?

评论 #42005307 未加载

评论 #42007771 未加载

评论 #42004972 未加载

watersb6 个月前

评论 #42003944 未加载

评论 #42003976 未加载

qwertox6 个月前

david_draco6 个月前

girishso6 个月前

评论 #42002466 未加载

评论 #42007790 未加载

forrestthewoods6 个月前

评论 #42002868 未加载

UltraSane6 个月前

Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.