Do we really need a specialized vector database?

147 点作者 gaocegege将近 2 年前

18 条评论

weinzierl将近 2 年前

It feels like Postgres is eating the world. In a thread a couple of days ago many argued that with the capability of efficiently storing and processing JSON we can do without MongoDB et al. Now we also replace vector databases. There are extensions that provide columnar storage and to be honest I never completely understood the need for specialized time series databases.It is a bit as if Postgres has become what Oracle wanted to be - a one-stop solution for all your data storage needs.

评论 #37098653 未加载

评论 #37098085 未加载

评论 #37098201 未加载

评论 #37100098 未加载

评论 #37099242 未加载

评论 #37099103 未加载

评论 #37098250 未加载

lampington将近 2 年前

Great article, but I'm saddened by their view that C is too hard to work with, so the 2-year-old extension must be rewritten in Rust.C certainly has its faults, and while I have no real experience with Rust, I'm willing to believe that it's significantly better as a language.But pgvector, at least from a quick scan, looks like a well-written, easily comprehensible C codebase with decent tests. There are undoubtedly lots of hard problems that the developers have solved in implementing it. Putting time and effort into reimplementing all of that in another language because of an aversion to C feels like a waste of effort that could be put into enhancing the existing extension.Maybe there's something I'm missing? Is the C implementation not as solid as it looks at first glance?

评论 #37100890 未加载

danielbln将近 2 年前

We've gone with Postgres and pg_vector which works really well thus far, especially since we know and like the existing Postgres tooling and vector functionality there is just a (widely supported) pg_vector extension away. Since it now also supports HNSW indices it should perform well on large amounts of vectorized data as well (though we haven't tested this yet).From a consultant perspective, Postgres as a requirement is a much easier sell then some new, hot but unknown dedicated vector DB.

评论 #37099521 未加载

blackcat201将近 2 年前

I have been following the vector database trend back in 2020 and I ended up with the conclusion: vector search features are a nice to have features which adds more value on existing database (postgres) or text search services (elasticsearch) than using an entirely new framework full of hidden bugs. You could get way higher speedup when you are using the right embedding models and encoding way than just using the vector database with the best underlying optimization. And the bonus side is that you are using a stack which was battle tested (postgres, elasticsearch) vs new kids (pinecone, milvus ... )

pmcf将近 2 年前

The Cassandra project recently[1] added vector search. Relative to a lot of other features, it was fairly simple. A new 'vector' type and an extension of the existing indexing system using the Lucene HNSW library. Now we'll be finding ways to optimize and improve performance with better algorithms and query schemes.What we won't be doing is figuring out how to scale to petabytes of data distributed across multiple data centers in a massive active-active cluster. We've spent the last 14 years perfecting that, and still have work to do. With the benefit of hindsight, if you have a database that is less than 10 years old, all I have to say is good luck. You have some challenging days ahead.1. <a href="https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes" rel="nofollow noreferrer">https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30...</a>

tbalsam将近 2 年前

If you can do it in GPU memory, do it in GPU memory.If it takes quantization + buying a 1 TB RAM server ($4k of RAM + parts), do that in memory with the raw tensors and shed a small tear -- both for cost and the joy of the pain that you are saving yourself, your team, and everyone around you.If you need more, then tread lightly and extremely carefully. Very few mega LLM pretraining datasets are even on this order of magnitude, though some are a few terabytes IIRC. If you are exceeding this, then your business usecase is likely specialized indeed.This message brought to you by the "cost reduction by not adding dumb complexity" group. I try to maintain a reputation for aggressively fighting unnecessary complexity, which is the true cost measure IMO of any system.

Roark66将近 2 年前

Is there a better article explaining how vector DBs are used with LLMs exactly? This 4 point description raises more questions than it answers for me.For example what is the difference between a "chunk vector" and a "prompt vector"? Aren't they essentially the same thing (a vector representation of text)?How do we "search the prompt vector for the most similar chunk vector"? A short code snippet is shown that queries the DB, but it's not shown what is done with what comes back. What format is the output in?I suspect this works by essentially replacing chunks of input text by shorter chunks of "roughly equivalent" text found in the vector DB and sending that as the prompt instead, but based on this description I can't be sure.

评论 #37098341 未加载

rvba将近 2 年前

> Search the prompt vector to find the most similar chunk vector.Does this mean that a question is changed to a similar question? Doesnt it decrease the quality of answers significantly?If someone asks "who is the best student in California?" will the question be changed to "what is the best school in California?".This would explain the terrible drops in quality of amswers that we saw. The underlying technology is changed for easier scaling, but is much worse.It's like the current google (what has multiple problems), where it also sometimes doesnt search your keywords - even when they are in quotation marks - because it knows "better"..

评论 #37097901 未加载

评论 #37098078 未加载

presentation将近 2 年前

Neon also made a new pg_embedding extension to add to the Postgres mix<a href="https://neon.tech/blog/pg-embedding-extension-for-vector-search" rel="nofollow noreferrer">https://neon.tech/blog/pg-embedding-extension-for-vector-sea...</a>

评论 #37098572 未加载

dmezzetti将近 2 年前

There isn't a best universal choice for all situations. If you're already using Postgres and all you want is to add vector search, pgvector might be good enough.txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>) sets out to be an all-in-one embeddings database. This is more than just being a vector database with semantic search. It can embed text into vectors, run LLM workflows, has components for sparse/keyword indexing and graph based search. It also has a relational layer built-in for metadata filtering.txtai currently supports SQLite/DuckDB for relational data but can be extended. For example, relational data could be stored in Postgres, sparse/dense vectors in Elasticsearch/Opensearch and graph data in Neo4j.I believe modular solutions like this where internal components can be swapped in and out are the best option but given I'm the author of txtai, I'm a bit biased. This setup enables the scaling and reliability of existing solutions balanced with someone being able get started quickly with a POC to evaluate the use case.

评论 #37100993 未加载

gwern将近 2 年前

> For instance, the 7B model requires approximately 10 seconds for inference on 300 Chinese characters on an Nvidia A100 (40GB).Is that right? That sounds like, 2 OOMs higher than I would've expected. Is he doing something wrong like loading the model from a cold start?

评论 #37104629 未加载

utopcell将近 2 年前

The article makes the argument that it is easier to query vector spaces when your database already supports them: why use an external vector db when you can use pgvector in postgreSQL ?That is a fine argument if you don't mind that pgvector is second-to-worst amongst all open-source vector search implementations, and two orders of magnitude slower than the state of the art [1].The author also makes the argument that traditional DBs are better because they are battle-tested, and then goes and rewrites the pgvector plugin from C to rust.[1] <a href="http://ann-benchmarks.com" rel="nofollow noreferrer">http://ann-benchmarks.com</a>

adam_oxla将近 2 年前

Great article. I was recently wondering why are vector databases useful at all. There are cases (LLMs) where they are useful. But apart from that, most of the cases I have encountered, requires indexing on different data types than vector. I might be biased due to my experience but in a very typical use cases e.g. analyzing event streams or building recommendation engine you want to filter rows by timestamp and client id or product category and availability. Vector index is not that useful in those cases.That's why I think that outside some very specific use cases vector databases are not very useful.

keskival将近 2 年前

It's also possible to use HNSW indices with Postgres using pg_embedding instead of pgvector: <a href="https://neon.tech/blog/pg-embedding-extension-for-vector-search" rel="nofollow noreferrer">https://neon.tech/blog/pg-embedding-extension-for-vector-sea...</a>Edit:Ah I see pgvector will soon support HNSW as well (from 0.5.0): <a href="https://github.com/pgvector/pgvector/issues/181#issuecomment-1633037078">https://github.com/pgvector/pgvector/issues/181#issuecomment...</a>

mark_l_watson将近 2 年前

Author makes a good argument. I wanted to experiment with local do-it-myself in-memory vector embeddings to use with OpenAI APIs in Common Lisp and Swift. Simple to do, and I added short examples to both my Common Lisp and Swift books. I was also experimenting with simply persisting the data stores to SQLite. Anyway, really simple stuff, and fun to play with.

ThinkBeat将近 2 年前

The work with deciding what words closely related and and which ones are not seems like a hugly difficult problem where there is no single "right" or "wrong". I would think it fits within linguistics.

charcircuit将近 2 年前

What if your main data is in mysql?Lack of consistency is not a big deal and you often don't have it at scale anyways.

nologic01将近 2 年前

"there is a postgres plugin for that"