TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Craziest thing I ever used SQLite for: partial file deduplication (2022)

343 点作者 ics大约 2 年前

16 条评论

megous大约 2 年前
I used a similar trick to stuff 14 Linux distros onto a 6 GiB sized SD card image using btrfs filesystem. The distros share a lot of common files, so this works well to save space. (btrfs also supports the data block sharing CoW feature as APFS)<p><a href="https:&#x2F;&#x2F;xnux.eu&#x2F;p-boot-demo&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xnux.eu&#x2F;p-boot-demo&#x2F;</a><p>I had to write a custom tool to do it efficiently during image creation.
评论 #35374727 未加载
评论 #35320011 未加载
mastax大约 2 年前
That&#x27;s really cool. There are a bunch of tools that will let you symlink or hard link deduplicate files, but being able to do block-level dedupes simply by leaning on the filesystem is nice.<p>It sometimes feels like games are made to thwart this type of thing. They often use packfiles, basically filesystems within files optimized to look up assets quickly. Also perhaps they allowed optimized data layout from when consoles had slow spinning hard drives. The upshot is that a tiny patch inserting a line of code in a script may offset hundreds of megabytes of other data in the packfiles, causing the block hashes to no longer match up. Do any filesystems model inserts in some way? I&#x27;m pretty sure Steam updates can handle situations like that. I frequently see updates which download a tiny amount (kilobytes) but write a huge amount to disk (gigabytes), and I can&#x27;t think of any other cause. (Assuming developers aren&#x27;t using hilariously un-compressed assets).
评论 #35318138 未加载
评论 #35320257 未加载
评论 #35337485 未加载
评论 #35319174 未加载
evmar大约 2 年前
Coincidentally, I just saw that the &quot;USENIX Test of Time Award&quot; for 2023 was an analysis for deduplication purposes of real-world files. I found Figure 1 particularly interesting in that partial deduplication didn&#x27;t save much over whole-file based deduplication in practice.<p><a href="https:&#x2F;&#x2F;www.usenix.org&#x2F;conferences&#x2F;test-of-time-awards" rel="nofollow">https:&#x2F;&#x2F;www.usenix.org&#x2F;conferences&#x2F;test-of-time-awards</a>
评论 #35318366 未加载
hultner大约 2 年前
Very cool and probably a fun exercise, but I would probably put the data on a ZFS volume with dedupe instead, which from reading this implementation is pretty much the same thing to my layman’s perspective. I could also add compression to the same dataset.
评论 #35318030 未加载
评论 #35324050 未加载
评论 #35319800 未加载
sgtnoodle大约 2 年前
That&#x27;s neat! In retrospect, did the database make the problem more tractable from an algorithmic or practical standpoint, or was it mostly just using the tools you&#x27;re familiar with? If I were to approach the same problem, I likely would have tried to keep all the data in RAM and serialize it out to a file for incremental runs.
评论 #35317719 未加载
评论 #35317741 未加载
评论 #35319460 未加载
leetbulb大约 2 年前
&quot;There is no file path, because macOS lets you look up files by id. The hash values are cryptographic hashes truncated to 64 bits and reinterpreted as integers.&quot;<p>Is the author implying that APFS or HFS uses this method to calculate the file ID? I am unable to find any information regarding this. From what I understand, w.r.t APFS, the file ID is a combination of the inode OID and genID.
评论 #35320876 未加载
评论 #35319972 未加载
maxyurk大约 2 年前
<a href="https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Content-addressable_storage" rel="nofollow">https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Content-addressable_storage</a>
评论 #35318888 未加载
rurban大约 2 年前
Using proper a hash table with a bloom filter would save you the useless pass through a b-tree though. Much faster and much less memory
评论 #35319151 未加载
JackSlateur大约 2 年前
Looks like <a href="https:&#x2F;&#x2F;github.com&#x2F;markfasheh&#x2F;duperemove">https:&#x2F;&#x2F;github.com&#x2F;markfasheh&#x2F;duperemove</a>
评论 #35322127 未加载
评论 #35319708 未加载
fbdab103大约 2 年前
Does anyone else have any other unorthodox use cases? I love SQLite, and am always happy to ram this square peg into a round hole.
评论 #35320722 未加载
评论 #35324165 未加载
anyfoo大约 2 年前
Oh sweet. I&#x27;ve definitely used sqlite (in a zsh script) for <i>file</i> deduplication. Very simple, mostly just rows consisting of paths and file content hash.<p>But <i>partial</i> file deduplication is something else...
knagy大约 2 年前
Reminds me of this article from Riot Games about how they deliver patches: <a href="https:&#x2F;&#x2F;technology.riotgames.com&#x2F;news&#x2F;supercharging-data-delivery-new-league-patcher" rel="nofollow">https:&#x2F;&#x2F;technology.riotgames.com&#x2F;news&#x2F;supercharging-data-del...</a>
评论 #35336121 未加载
评论 #35324537 未加载
zubairq大约 2 年前
This is quite amazing. I actually built something even more crazy with Sqlite where I broke up the files into parts and then made hashes of the parts, so in total it would use less space than the sum of the files&#x27; space. i used this for a similarity engine where I would try to see how similar different files were
skerit大约 2 年前
I don&#x27;t get it. How does this SQLite database interact with the AFS volume?
tripleo1大约 2 年前
Slightly off topic -&gt; there was a project I seen on gh that claimed to support system administration using a relational tables. Something like everything in SQL. I thought it might be a cool idea.
评论 #35319847 未加载
MithrilTuxedo大约 2 年前
This sounds like it could be bolted onto&#x2F;into rsync on the server side to present filesystems larger than the server can actually store.