Choosing vector database: a side-by-side comparison

187 点作者 emilfroberg超过 1 年前

32 条评论

jn2clark超过 1 年前

As others have correctly pointed out, to make a vector search or recommendation application requires a lot more than similarity alone. We have seen the HNSW become commoditised and the real value lies elsewhere. Just because a database has vector functionality doesn’t mean it will actually service anything beyond “hello world” type semantic search applications. IMHO these have questionable value, much like the simple Q and A RAG applications that have proliferated. The elephant in the room with these systems is that if you are relying on machine learning models to produce the vectors you are going to need to invest heavily in the ML components of the system. Domain specific models are a must if you want to be a serious contender to an existing search system and all the usual considerations still apply regarding frequent retraining and monitoring of the models. Currently this is left as an exercise to the reader - and a very large one at that. We (<a href="https://github.com/marqo-ai/marqo">https://github.com/marqo-ai/marqo</a>, I am a co-founder) are investing heavily into making the ML production worthy and continuous learning from feedback of the models as part of the system. Lots of other things to think about in how you represent documents with multiple vectors, multimodality, late interactions, the interplay between embedding quality and HNSW graph quality (i.e. recall) and much more.

评论 #37774502 未加载

softwaredoug超过 1 年前

Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete

评论 #37769276 未加载

评论 #37769300 未加载

评论 #37767509 未加载

评论 #37768148 未加载

评论 #37770808 未加载

评论 #37773476 未加载

评论 #37774412 未加载

评论 #37769205 未加载

评论 #37774305 未加载

评论 #37768733 未加载

BeetleB超过 1 年前

Let me half hijack to ask a related question:I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.Any tips on how people measure the performance/effectiveness for these types of problems?

评论 #37768761 未加载

评论 #37768499 未加载

评论 #37768636 未加载

评论 #37768698 未加载

评论 #37774544 未加载

评论 #37775143 未加载

评论 #37776965 未加载

dmezzetti超过 1 年前

I'll add txtai to the list: <a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.Disclaimer: I am the author of txtai

评论 #37773038 未加载

评论 #37768888 未加载

drewbug01超过 1 年前

I really appreciate comparisons like this, although I find myself wanting to know more about why certain things are listed the way they are.For example, pgvector is listed as not having role-based access control, but the Postgres manual dedicates an entire chapter to it: <a href="https://www.postgresql.org/docs/current/user-manag.html" rel="nofollow noreferrer">https://www.postgresql.org/docs/current/user-manag.html</a>Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.

评论 #37767709 未加载

评论 #37766485 未加载

评论 #37767224 未加载

评论 #37767856 未加载

emilfroberg超过 1 年前

I made this table to compare vector databases in order to help me choose the best one for a new project. I spent quite a few hours on it, so I wanted to share it here too in hopes it might help others as well. My main criteria when choosing vector DB were the speed, scalability, dx, community and price. You'll find all of the comparison parameters in the article.

评论 #37767924 未加载

panarky超过 1 年前

I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?It seems like measuring precision and recall for hybrid queries would be illuminating.

评论 #37767850 未加载

评论 #37770083 未加载

评论 #37774586 未加载

评论 #37773811 未加载

评论 #37767570 未加载

donretag超过 1 年前

Curious about the lack of Vespa, especially given the thoroughness of the article and its long-time reputation. OpenSearch is also missing, but perhaps it can be considered being lumped in with Elasticsearch due to them both being based on Lucene. The products are starting to diverge, so would be nice to see, especially since it is open-source.For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.

评论 #37769091 未加载

deepsquirrelnet超过 1 年前

What advantage are vector databases providing above using an index in conjunction with a mature database? I’m not sold on this as a separate technology.Vector search is useful, but I don’t understand why I would go out of my way when I could implement FAISS or HNSWlib as an adjunct to postgres or a document store.

评论 #37769539 未加载

评论 #37771051 未加载

评论 #37769544 未加载

citruscomputing超过 1 年前

Strongly disagree with PGVector's DX being worse than Chroma. Installing, configuring, and working with Chroma was infuriating -- it's alpha software and has the bugs and rough edges to prove it. The tools to support and interface with postgres are battle-tested and so much nicer by comparison; getting Chroma working took over a week, ripping it out and replacing with PGVector took a couple hours.Also agree with this[0] article that vector search is only one type of search, and even for RAG isn't necessarily the one you want to start with.[0]: <a href="https://colinharman.substack.com/p/beware-tunnel-vision-in-ai-retrieval" rel="nofollow noreferrer">https://colinharman.substack.com/p/beware-tunnel-vision-in-a...</a>

评论 #37769676 未加载

评论 #37771976 未加载

评论 #37769543 未加载

Pandabob超过 1 年前

I've been wondering about Redis as vector database [0].[0]: <a href="https://twitter.com/sh_reya/status/1661136833848438784" rel="nofollow noreferrer">https://twitter.com/sh_reya/status/1661136833848438784</a>

评论 #37769691 未加载

评论 #37771818 未加载

J_Shelby_J超过 1 年前

Nice post! I think this could be a very good page to bookmark.There is also this series of articles detailing the options and it includes some that the OP is missing: <a href="https://thedataquarry.com/posts/vector-db-1/#key-takeaways" rel="nofollow noreferrer">https://thedataquarry.com/posts/vector-db-1/#key-takeaways</a>I'm currently in the market for a self hosted DB for a personal project. The project is an app you can run on your own system and provide QA on your text files. So I'm looking for something light weight, but I'm also looking for the best possible search and ANN retrieval is just a single part of that.

dathinab超过 1 年前

Their definition about Hybrid Search is I think wrong.Through this terms tend to not be consistently defined at all, so "wrong" is maybe the wrong word.Their definition seem to be about filtering results during (approximate) KNN vector search.But that is filtering, not hybrid search. Through it might sometimes be implemented as a form of hybrid search, but that's an internal implementation detail and you probably should hope it's not implemented that way.Hybrid search is when you do both a vector search and a more classical text based search (e.g. bm25) and combine both results in a reasonable way.

评论 #37771693 未加载

prabhatjha超过 1 年前

This is interesting because it does not mention Vector database powered by Apache Cassandra or the hosted serverless version DataStax Astra. Here is write up we did on 5 hard problems in Vector database and how we solved them. <a href="https://thenewstack.io/5-hard-problems-in-vector-search-and-how-cassandra-solves-them/" rel="nofollow noreferrer">https://thenewstack.io/5-hard-problems-in-vector-search-and-...</a>In full transparency: I work for DataStatx and lead engineering for Vector database.

magden超过 1 年前

I don't think we need specialized databases for vectors. Relational databases can easily be expanded by vector data types and operations. They will eventually catch up by supporting what was once a unique feature of the new system: <a href="https://medium.com/@magda7817/two-things-to-keep-in-mind-before-you-start-building-another-database-system-47991974fa0f" rel="nofollow noreferrer">https://medium.com/@magda7817/two-things-to-keep-in-mind-bef...</a>

评论 #37771914 未加载

评论 #37770967 未加载

ldjkfkdsjnv超过 1 年前

Postgres vector store has been the most simple, and will be if you are at a lower scale. You can just use it directly with something like spring boot.

评论 #37770121 未加载

alxfoster超过 1 年前

Quick question regarding the scalability and support of multiple vector databases under a single cloud service. Suppose an enterprise Saas product served multiple customers with each requiring a unique RAG vector knowledge-base for product and company info. Do any of these solutions allow for a large number (dozens or hundreds) of small distinct Knowledge bases? Do any offer easily integrated automated pipelines for documents to be parsed and ingested?

__newmoon__超过 1 年前

Postgres with PGVector is the best database, plus vectors.All of the "Vector DBs" suffer horribly when trying to do basic things.Want to include any field that matches a field in an array of keys? Easy in SQL. Requires an entire song and dance in Pinecone or Weaviate.After implementing Chroma, Weaviate, Pinecone, Sqlite with HNSW indices and Qdrant-- I'm not impressed. Postgres is measurably faster since so much relies on pre-filtering, joins, etc.

评论 #37851665 未加载

iansinnott超过 1 年前

Strongly disagree about the Pinecone developer experience. Not that they don't have SDKs, but last I checked they didn't have documentation on how to approach local dev environments.The implication being that you spin up a separate index for $70/mo, and then you have to upsert any relevant data yourself. Sure that's not difficult, but why do you have to do it at all? Why doesn't Pinecone make it easy to replicate data to another index for use in dev/staging?

alter123超过 1 年前

You might want to add <a href="https://turbopuffer.com/" rel="nofollow noreferrer">https://turbopuffer.com/</a> as well now in the benchmarks.

评论 #37768929 未加载

评论 #37768099 未加载

charliejuggler超过 1 年前

You might like the 'Which Search Engine?' panel I ran at Buzzwords earlier this year with some of the leading contenders (Vespa, Qdrant, Elastic, Solr, Weaviate) <a href="https://www.youtube.com/watch?v=iI40L4wMtyI">https://www.youtube.com/watch?v=iI40L4wMtyI</a> - vector search was obviously part of the discussion

BenoitP超过 1 年前

Pricing for pg should be easy to compute20M vectors @768 is about 62GB, for 32bit, not even quantized. AWS RDS will put it at 83USD/m (db.t4g.small, 2vcpu 2GB RAM). But that's not with egress, backups, etcSeems acceptable at least for a POC?A better option if you already have the data in the same instance, but developer experience being low scares me. Anyone tried it? How did it go?

评论 #37767772 未加载

totalhack超过 1 年前

I'm interested to try some of these others next time around, but I've used qdrant self-hosted in two projects and been pleased. Milvus was recommended so I gave that a try but found it over complicated. Pgvector seems like an obvious choice if you are already using postgres and if that performance is ok.

评论 #37788339 未加载

krishadi超过 1 年前

Latency from embedding models is still going to be the bottleneck for performance however fast the DB is going to be. Plus adding all the overhead of synthesising answers and summaries from a LLM is going to weigh you down.

评论 #37772943 未加载

NicoJuicy超过 1 年前

I'm actually curious on how the new vector DB from cloudflare compares.

评论 #37768970 未加载

AYBABTME超过 1 年前

And soon, on MySQL/Vitess as well: <a href="https://planetscale.com/ai" rel="nofollow noreferrer">https://planetscale.com/ai</a>

kesor超过 1 年前

Redis is definitely missing in the comparison.

Havoc超过 1 年前

16x difference between pg and milvus?I thought for most use cases this would be quite performance sensitive

评论 #37785185 未加载

评论 #37771753 未加载

kadomony超过 1 年前

What do people think about MongoDB's search offering and its pivot into vectors?

评论 #37880734 未加载

brigadier132超过 1 年前

None of these vector dbs seem economical outside of enterprise.

评论 #37771861 未加载

评论 #37803322 未加载

评论 #37769721 未加载

lazy_moderator1超过 1 年前

also, typesense

la64710超过 1 年前

Somehow I felt that at least part of the articles was generated by a LLM. It’s unfortunate to see that a new bias has started to creep up. Whatever I read now I second guess and I feel it maybe partially or fully generated by LLMs.