If I understand correctly, this is attaching metadata to files in a format that LLMs (or any tool that can understand the semantic embedding vector) can leverage to understand what a file is without having to actually read the contents of the file.<p>That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?
Fun idea storing embeddings in inodes! Very clever!<p>I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: <a href="https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli.py">https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....</a><p>Still, fun idea :)
Great idea indeed.
The documentation needs a bit more information to be useful.
What GPU backends are supported for example?
How do I delete the embedding information after I decide to uninstall it?
Will give it a try though.
Might be interesting to add an optional embedded Weaviate [1] with a flat-index [2] to the project. It wouldn't use external services and is fully disk-based. Would allow you to search the whole filesystem (about 1.5kb per file (384 dimensions) which would be added to the metadata as well).<p>1. <a href="https://weaviate.io/developers/weaviate/installation/embedded" rel="nofollow">https://weaviate.io/developers/weaviate/installation/embedde...</a>
2. <a href="https://weaviate.io/developers/academy/py/vector_index/flat" rel="nofollow">https://weaviate.io/developers/academy/py/vector_index/flat</a>
I've long wanted to have a linux filesystem that robustly supported "tags" for files so that I didn't have to rely on the filesystem hierarchy to represent media files etc. e.g. I might want to tag a particular films as "Scifi" and also "Horror". Of course, for films, NFO files are typically used for this kind of metadata, but I'd like a similar facility that could be applied to any type of file.
I've been wondering for about 20 years why File Systems basically died and stopped innovating. For example we have lots of hierarchical data structures in the world, and no one seems to have figured out how to let a folder be the storage, instead of always just databases.<p>For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.<p>Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.<p>I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.
> Zero-overhead indexing Embeddings are stored as extended attributes (xattrs) on each file, eliminating the need for external index files or services.<p>Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.
I think comparing it to Vector Database is confusing as database would typically mean indexes and some sort of query support.<p>Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.<p>Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates
Gotta say, the old school debate on filesystems vs databases will never get old for me - I always end up with more questions than answers after reading stuff like this.
I did something similar, but I use these EXT4 requirements:<p><pre><code> - hard links (only tar works for backup)
- small file size (or inodes run out before disk space)
</code></pre>
<a href="http://root.rupy.se" rel="nofollow">http://root.rupy.se</a><p>It's very useful for global distributed real-time data that don't need the P in CAP for writes.<p>(no new data can be created if one node is offline = you can login, but not register)
This immediately made me nostalgic for BeOS's BeFS or Windows Longhorn's WinFS database filesystems, and how this kind of thing would have fit them perfect. So much cool stuff you could do with vectors for everything. Smart folders that include files for a project based on a description of the project. Show me all of my config files for appXYZ. Images of a black dog at the beach. At the OS-level for any other app to easily tap into.<p>I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.
I’ve found that starting with a plain old filesystem often outperforms fancy services - just as the Unix philosophy (“everything is a file” [1]) has preached for decades [2].<p>When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:<p><pre><code> 1. uploaded the raw logs to Cloud Storage, and
2. tracked state with three folders: `pending/`, `processing/`, `done/`.
</code></pre>
A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.<p>Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.<p>[1] <a href="https://en.wikipedia.org/wiki/Everything_is_a_file" rel="nofollow">https://en.wikipedia.org/wiki/Everything_is_a_file</a>
[2] <a href="https://en.wikipedia.org/wiki/Unix_philosophy" rel="nofollow">https://en.wikipedia.org/wiki/Unix_philosophy</a>
The idea that filesystems are not just a flavor of database management systems was always a mistake.<p>Maybe with micro-kernels we'll finally fix this.