Show HN: VectorVFS, your filesystem as a vector database

269 点作者 perone3 天前

22 条评论

If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?

评论 #43897289 未加载

评论 #43899718 未加载

评论 #43897169 未加载

malcolmgreaves3 天前

Fun idea storing embeddings in inodes! Very clever!I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: <a href="https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli.py">https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....</a>Still, fun idea :)

评论 #43896358 未加载

评论 #43896265 未加载

评论 #43896346 未加载

评论 #43897125 未加载

评论 #43896863 未加载

anotherpaul3 天前

Great idea indeed. The documentation needs a bit more information to be useful. What GPU backends are supported for example? How do I delete the embedding information after I decide to uninstall it? Will give it a try though.

评论 #43896356 未加载

thirdtrigger3 天前

Might be interesting to add an optional embedded Weaviate [1] with a flat-index [2] to the project. It wouldn't use external services and is fully disk-based. Would allow you to search the whole filesystem (about 1.5kb per file (384 dimensions) which would be added to the metadata as well).1. <a href="https://weaviate.io/developers/weaviate/installation/embedded" rel="nofollow">https://weaviate.io/developers/weaviate/installation/embedde...</a> 2. <a href="https://weaviate.io/developers/academy/py/vector_index/flat" rel="nofollow">https://weaviate.io/developers/academy/py/vector_index/flat</a>

评论 #43899984 未加载

ndsipa_pomu2 天前

I've long wanted to have a linux filesystem that robustly supported "tags" for files so that I didn't have to rely on the filesystem hierarchy to represent media files etc. e.g. I might want to tag a particular films as "Scifi" and also "Horror". Of course, for films, NFO files are typically used for this kind of metadata, but I'd like a similar facility that could be applied to any type of file.

评论 #43904215 未加载

quantadev3 天前

I've been wondering for about 20 years why File Systems basically died and stopped innovating. For example we have lots of hierarchical data structures in the world, and no one seems to have figured out how to let a folder be the storage, instead of always just databases.For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.

评论 #43902998 未加载

yencabulator1 天前

> Zero-overhead indexing Embeddings are stored as extended attributes (xattrs) on each file, eliminating the need for external index files or services.Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.

b0a04gl3 天前

If VectorVFS obscures retrieval logic behind opaque embeddings, how do users debug why a file surfaced—or worse, why one didn’t?

评论 #43897444 未加载

评论 #43900141 未加载

评论 #43897102 未加载

评论 #43896687 未加载

PeterZaitsev3 天前

I think comparing it to Vector Database is confusing as database would typically mean indexes and some sort of query support.Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates

gitroom3 天前

Gotta say, the old school debate on filesystems vs databases will never get old for me - I always end up with more questions than answers after reading stuff like this.

评论 #43900168 未加载

bullen3 天前

I did something similar, but I use these EXT4 requirements:<pre><code> - hard links (only tar works for backup) - small file size (or inodes run out before disk space) </code></pre> <a href="http://root.rupy.se" rel="nofollow">http://root.rupy.se</a>It's very useful for global distributed real-time data that don't need the P in CAP for writes.(no new data can be created if one node is offline = you can login, but not register)

natas3 天前

this is actually a great idea

评论 #43896622 未加载

PeterStuer3 天前

If there is no indexing, how will your search time not increase linear or worse with the number of files?

esafak3 天前

Files-as-vector stores is LanceDB's value proposition. How do you compare in performance, etc.?

评论 #43897474 未加载

asadawadia3 天前

is the embedding for the whole file? or each 1024/512 byte chunk?

javier23 天前

i looked into something similar a few years ago, where i stored embeddings in xattrs

adenta3 天前

I wonder if I could use this locally on my macbook. The finder applications built-in search is kinda meh.

评论 #43897501 未加载

badmonster3 天前

interesting

pseudosavant3 天前

This immediately made me nostalgic for BeOS's BeFS or Windows Longhorn's WinFS database filesystems, and how this kind of thing would have fit them perfect. So much cool stuff you could do with vectors for everything. Smart folders that include files for a project based on a description of the project. Show me all of my config files for appXYZ. Images of a black dog at the beach. At the OS-level for any other app to easily tap into.I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.

评论 #43897517 未加载

评论 #43897581 未加载

评论 #43898234 未加载

tzury3 天前

I’ve found that starting with a plain old filesystem often outperforms fancy services - just as the Unix philosophy (“everything is a file” [1]) has preached for decades [2].When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:<pre><code> 1. uploaded the raw logs to Cloud Storage, and 2. tracked state with three folders: `pending/`, `processing/`, `done/`. </code></pre> A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.[1] <a href="https://en.wikipedia.org/wiki/Everything_is_a_file" rel="nofollow">https://en.wikipedia.org/wiki/Everything_is_a_file</a> [2] <a href="https://en.wikipedia.org/wiki/Unix_philosophy" rel="nofollow">https://en.wikipedia.org/wiki/Unix_philosophy</a>

评论 #43897487 未加载

评论 #43897089 未加载

评论 #43897077 未加载

评论 #43897042 未加载

评论 #43897087 未加载

评论 #43897173 未加载

评论 #43897165 未加载

Ericson23143 天前

The idea that filesystems are not just a flavor of database management systems was always a mistake.Maybe with micro-kernels we'll finally fix this.

评论 #43896647 未加载

评论 #43896611 未加载

评论 #43897061 未加载

评论 #43896531 未加载

评论 #43897056 未加载

评论 #43896779 未加载

评论 #43896887 未加载

colordrops3 天前