Hi HN. Peter here. As a machine learning engineer, I mostly think in terms of feature vectors, embeddings, and matrices. One of the most useful byproducts of deep neural networks is embeddings because they allow us to represent high-dimensional data in terms of lower-dimensional latent vectors. These feature vectors can be used for downstream applications like similarly search, recommendation systems and near duplicate detection.<p>As an ML engineer, I was frustrated by the lack of a datastore in which vectors are first-class citizens. As a result, most ML engineers, including myself, end up using awkward workarounds to store vectors such as arrays in SQL/NoSQL databases, stringifying vectors and storing them as text in in-memory-based caching systems such as Redis ETC. Furthermore, these systems don't allow for vector-based query operations such as nearest neighbor search. Consequently, engineers have to deploy additional approximate nearest neighbor search systems such as Facebook's FAISS or Spotify's ANNOY. These systems, while nifty and fast, are difficult to install and are costly to maintain.
To address these issues, I built NNext, a managed vector datastore in which vectors are first class citizens. NNext, allows you to store vectors along with any json-blob metadata. Furthermore, NNext comes with a fast approximate nearest-neighbor (ANN) search capability.<p>I would love to get feedback on your experience as Data Scientist or ML engineer storing feature vectors and ANN systems. Please shoot me an email at [redacted].
I can't wait to see the vector DBs we're going to have 5 years from now when everyone needs to serve their embeddings. It's clearly early but frothy right now.<p>Also check out this similar co I ran into: <a href="https://www.pinecone.io/" rel="nofollow">https://www.pinecone.io/</a> (the CEO, EL, has some classic sketches and feature hashing papers).
Every time I've needed something like this (3 times in my career, though none in the past 5 years) I've ended up having to implement it from scratch.<p>Depending on the topology of the data, different metrics are used and you can make different performance tradeoffs. And differing business use cases make it hard to find a one-size-fits-all. As I'm sure you're aware, there are dozens of fast ANN approaches, and the code implementing the ANN often is the tightest loop, making hard to be pluggable and performant.<p>TileDB is also quite interesting.<p>Best of luck though, I agree with you there's a "missing product" to be invented, I'm not sure this is exactly it, but I don't think I know any better.
I love this idea, but there's no way a managed service like this is going to work, for two reasons: Performance and compliance.<p>Compliance is pretty obvious and even if you don't store the feature dictionaries and only the vectors, that's a hard conversation to have with the compliance team. I GET IT, without the feature dictionaries the vectors are useless, I KNOW this is how it works from a technical point of view, but the compliance team still won't sign off on it. That's just the way of the world.<p>And much much more important is performance, uncompressed high dimensional features are huge and even if you use run-length-encoding or sparse vector storage in the protocol plus some lossless compression, I have trouble keeping the GPU fed from disk, let alone over the network, it's going to be multiple orders of magnitude too slow. If the benefits is not claimed to be streaming but are fast vector similarity, keep in mind I can do cosine similarity on literally millions of vectors a second on a single CPU core using vanilla numpy, this was fast enough for me to implement realtime face recognition vector search for Dubai airport, so pretty high scale operations.<p>I've love a self hosted version of this, optimised for I/O throughput to the GPU. That would be great.
Not on topic for the actual product... but <i>please</i> remove that terrible scrolling behaviour. I hate when websites do this, it makes the user experience (imo) much worse to just "jump" sections if I try to scroll a tiny bit.
What I've done for most of my research projects is just pickle a PyTorch dataset object that contains all my embeddings. The pkl file can then just be uploaded anywhere and becomes plug and play with any Torch model.<p>What advantages would this bring for a user like me? I guess it might make more sense for people working closer to production?
Why not use an opensource vector search engine/database like Milvus? <a href="https://github.com/milvus-io/milvus" rel="nofollow">https://github.com/milvus-io/milvus</a>
It's probably one of the most popular solution now.
Milvus is really easy to use, and it's oss.<p><a href="https://github.com/milvus-io/milvus" rel="nofollow">https://github.com/milvus-io/milvus</a>