Hi everyone!<p>Over the years, I've found myself building hacky solutions to serve and manage my embeddings. I’m excited to share Embeddinghub, an open-source vector database for ML embeddings. It is built with four goals in mind:<p>Store embeddings durably and with high availability<p>Allow for approximate nearest neighbor operations<p>Enable other operations like partitioning, sub-indices, and averaging<p>Manage versioning, access control, and rollbacks painlessly<p>It's still in the early stages, and before we committed more dev time to it we wanted to get your feedback. Let us know what you think and what you'd like to see!<p>Repo: <a href="https://github.com/featureform/embeddinghub" rel="nofollow">https://github.com/featureform/embeddinghub</a><p>Docs: <a href="https://docs.featureform.com/" rel="nofollow">https://docs.featureform.com/</a><p>What's an Embedding? The Definitive Guide to Embeddings: <a href="https://www.featureform.com/post/the-definitive-guide-to-embeddings" rel="nofollow">https://www.featureform.com/post/the-definitive-guide-to-emb...</a>
Where can I find documentation on versioning? My first use case would be to versión different embeddings and use it more like a storage backend than to search for KNN. Would it be possible to not create the NN graph and just use it for versioned storage? We currently use opendistro and it nicely allows doing pre and post filtering based on other document fields (other than the embedding). Therefore I think this could never be a full replacement without figuring out how to combine the rest of the document structure
Cool! Nice work! Do you have any performance numbers you could share?<p>Specifically around nearest neighbor computation latency, a regular get embedding latency, read/write rate achieved on a machine?
This is really great! It speaks very much to my use-case (building user embeddings and serving them both to analysts + other ML models).<p>I was wondering if there was a reasonable way to store raw data next to the embeddings such that:
1. Analysts can run queries to filter down to a space they understand (the raw data).
2. Nearest neighbors can be run on top of their selection on the embedding space.<p>Our main use case is segmentation, so giving analysts access to the raw feature space is very important.
Nice, are there any benchmarks?<p>Would be interesting to see how it compares to Postgres or LevelDB for read/write of exact values<p>And how it compares to Faiss/Annoy for KNN
Great work! Looks like you are using HNSWLIB. From what I understand HNSW graph based approach can be memory intensive compared PQ code based approach. FAISS has support for both HNSW and PQ codes. Any plans on extending your work to support PQ code based index in future?