科技回声

16 条评论

megous大约 2 年前

I used a similar trick to stuff 14 Linux distros onto a 6 GiB sized SD card image using btrfs filesystem. The distros share a lot of common files, so this works well to save space. (btrfs also supports the data block sharing CoW feature as APFS)<a href="https://xnux.eu/p-boot-demo/" rel="nofollow">https://xnux.eu/p-boot-demo/</a>I had to write a custom tool to do it efficiently during image creation.

评论 #35374727 未加载

评论 #35320011 未加载

mastax大约 2 年前

That's really cool. There are a bunch of tools that will let you symlink or hard link deduplicate files, but being able to do block-level dedupes simply by leaning on the filesystem is nice.It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I'm pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can't think of any other cause. (Assuming developers aren't using hilariously un-compressed assets).

评论 #35318138 未加载

评论 #35320257 未加载

评论 #35337485 未加载

评论 #35319174 未加载

evmar大约 2 年前

Coincidentally, I just saw that the "USENIX Test of Time Award" for 2023 was an analysis for deduplication purposes of real-world files. I found Figure 1 particularly interesting in that partial deduplication didn't save much over whole-file based deduplication in practice.<a href="https://www.usenix.org/conferences/test-of-time-awards" rel="nofollow">https://www.usenix.org/conferences/test-of-time-awards</a>

评论 #35318366 未加载

hultner大约 2 年前

Very cool and probably a fun exercise, but I would probably put the data on a ZFS volume with dedupe instead, which from reading this implementation is pretty much the same thing to my layman’s perspective. I could also add compression to the same dataset.

评论 #35318030 未加载

评论 #35324050 未加载

评论 #35319800 未加载

sgtnoodle大约 2 年前

That's neat! In retrospect, did the database make the problem more tractable from an algorithmic or practical standpoint, or was it mostly just using the tools you're familiar with? If I were to approach the same problem, I likely would have tried to keep all the data in RAM and serialize it out to a file for incremental runs.

评论 #35317719 未加载

评论 #35317741 未加载

评论 #35319460 未加载

leetbulb大约 2 年前

"There is no file path, because macOS lets you look up files by id. The hash values are cryptographic hashes truncated to 64 bits and reinterpreted as integers."Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.

评论 #35320876 未加载

评论 #35319972 未加载

maxyurk大约 2 年前

<a href="https://en.m.wikipedia.org/wiki/Content-addressable_storage" rel="nofollow">https://en.m.wikipedia.org/wiki/Content-addressable_storage</a>

评论 #35318888 未加载

rurban大约 2 年前

Using proper a hash table with a bloom filter would save you the useless pass through a b-tree though. Much faster and much less memory

评论 #35319151 未加载

JackSlateur大约 2 年前

Looks like <a href="https://github.com/markfasheh/duperemove">https://github.com/markfasheh/duperemove</a>

评论 #35322127 未加载

评论 #35319708 未加载

fbdab103大约 2 年前

Does anyone else have any other unorthodox use cases? I love SQLite, and am always happy to ram this square peg into a round hole.

评论 #35320722 未加载

评论 #35324165 未加载

anyfoo大约 2 年前

Oh sweet. I've definitely used sqlite (in a zsh script) for file deduplication. Very simple, mostly just rows consisting of paths and file content hash.But partial file deduplication is something else...

knagy大约 2 年前

Reminds me of this article from Riot Games about how they deliver patches: <a href="https://technology.riotgames.com/news/supercharging-data-delivery-new-league-patcher" rel="nofollow">https://technology.riotgames.com/news/supercharging-data-del...</a>

评论 #35336121 未加载

评论 #35324537 未加载

zubairq大约 2 年前

This is quite amazing. I actually built something even more crazy with Sqlite where I broke up the files into parts and then made hashes of the parts, so in total it would use less space than the sum of the files' space. i used this for a similarity engine where I would try to see how similar different files were

skerit大约 2 年前

I don't get it. How does this SQLite database interact with the AFS volume?

tripleo1大约 2 年前

Slightly off topic -> there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.

评论 #35319847 未加载

MithrilTuxedo大约 2 年前

This sounds like it could be bolted onto/into rsync on the server side to present filesystems larger than the server can actually store.

16 条评论

megous大约 2 年前

评论 #35374727 未加载

评论 #35320011 未加载

mastax大约 2 年前

评论 #35318138 未加载

评论 #35320257 未加载

评论 #35337485 未加载

评论 #35319174 未加载

evmar大约 2 年前

评论 #35318366 未加载

hultner大约 2 年前

评论 #35318030 未加载

评论 #35324050 未加载

评论 #35319800 未加载

sgtnoodle大约 2 年前

评论 #35317719 未加载

评论 #35317741 未加载

评论 #35319460 未加载

leetbulb大约 2 年前

评论 #35320876 未加载

评论 #35319972 未加载

maxyurk大约 2 年前

<a href="https://en.m.wikipedia.org/wiki/Content-addressable_storage" rel="nofollow">https://en.m.wikipedia.org/wiki/Content-addressable_storage</a>

评论 #35318888 未加载

rurban大约 2 年前

Using proper a hash table with a bloom filter would save you the useless pass through a b-tree though. Much faster and much less memory

评论 #35319151 未加载

JackSlateur大约 2 年前

Looks like <a href="https://github.com/markfasheh/duperemove">https://github.com/markfasheh/duperemove</a>

评论 #35322127 未加载

评论 #35319708 未加载

fbdab103大约 2 年前

Does anyone else have any other unorthodox use cases? I love SQLite, and am always happy to ram this square peg into a round hole.

评论 #35320722 未加载

评论 #35324165 未加载

anyfoo大约 2 年前

knagy大约 2 年前

评论 #35336121 未加载

评论 #35324537 未加载

zubairq大约 2 年前

skerit大约 2 年前

I don't get it. How does this SQLite database interact with the AFS volume?

tripleo1大约 2 年前

Slightly off topic -> there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.

评论 #35319847 未加载

MithrilTuxedo大约 2 年前

This sounds like it could be bolted onto/into rsync on the server side to present filesystems larger than the server can actually store.

Craziest thing I ever used SQLite for: partial file deduplication (2022)

16 条评论

Craziest thing I ever used SQLite for: partial file deduplication (2022)

16 条评论