Vector Search with OpenAI Embeddings: Lucene Is All You Need

92 点作者 kwindla超过 1 年前

15 条评论

moonchrome超过 1 年前

I'd say Postgres + pgvector is even simpler if you're doing small scale document search (eg. internal knowledge bases, documentation sources, codebase indexing, etc.).pgvector is even supported out of the box on Azure and AWS RDS.Just spin up a docker container [1], add a vector column to your table and you're ready for embedding search.[1] <a href="https://hub.docker.com/r/ankane/pgvector" rel="nofollow noreferrer">https://hub.docker.com/r/ankane/pgvector</a>If you're starting out with a prototype - do yourself a favour, steer clear of the chromadb examples with langchain. In fact steer clear of langchain in general :) Just go for OpenAI API and PostgreSQL+PGVector - you'll have to do some boilerplate - but the stuff in langchain is just terrible, you'll have to rewrite it and do the boilerplate at some point anyway and this stack is super simple to deploy.

评论 #37375659 未加载

评论 #37374230 未加载

评论 #37374073 未加载

评论 #37374227 未加载

评论 #37375782 未加载

jkb79超过 1 年前

It's an opinionated blog post published on Arxiv, masquerading as research.IMHO, it's a gigantic self-own and doesn’t promote Lucene in a good way. For example, by demonstrating how they get only 10 QPS out of a system with 1TB of memory and 96 v-cpu's (after 4 warmups).The HNSW implementation in Lucene is fair, and within the same order of magnitude as others. But, to get comparable performance, you must merge all immutable segments to a single segment, which all Lucene oriented benchmark does, but which is not that realistic for many production workloads where docs are updated/added in near real-time.

评论 #37378032 未加载

Version467超过 1 年前

I gotta be honest, I find it almost a little disrespectful that everyone started naming their shit „x is all you need“ even for very mundane stuff.Attention is all you need was a breakthrough paper. It fundamentally changed the ML landscape and got us out of a huge roadblock with rnns.If you seriously think you have something similarly impactful on your hands, then sure go ahead with that name. But there’s been a bunch of papers where I found it distasteful. At best its just not funny. But this isn’t even really much of a paper. I’ve seen blog posts with more substance. Hell, even YouTube videos.I don’t know, I guess I just don’t really get the joke.

评论 #37376299 未加载

评论 #37452296 未加载

评论 #37375696 未加载

dmezzetti超过 1 年前

In terms of "All You Need" for Vector Search, ANN Benchmarks (<a href="https://ann-benchmarks.com/" rel="nofollow noreferrer">https://ann-benchmarks.com/</a>) is a good site to review when deciding what you need. As with anything complex, there often isn't a universal solution.txtai (<a href="https://github.com/neuml/txtai">https://github.com/neuml/txtai</a>) can build indexes with Faiss, Hnswlib and Annoy. All 3 libraries have been around at least 4 years and are mature. txtai also supports storing metadata in SQLite, DuckDB and the next release will support any JSON-capable database supported by SQLAlchemy (Postgres, MariaDB/MySQL, etc).

ftkftk超过 1 年前

I think this depends entirely on scale and performance metrics. For many smaller use cases using Lucene (or postgres, or elasticsearch or whatever else you already have running in your stack) is perfectly adequate as this paper shows. But as soon as you add a large dataset or high index/search volume you are likely better served with an actual vector datastore. The paper even acknowledges slow indexing performance and a low 9.8 queries per second on decent hardware. Will it perform fine for your couple of hundred pages of internal wiki? Sure. But I think your time is likely better spent learning to deploy and manage a new tech in your stack than figuring out how to work around these significant limitations at scale.

TuringNYC超过 1 年前

The article says:"We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection......This suggests that, from a simple cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern "AI stack" for search, since such applications have already received substantial investments in existing, widely deployed infrastructure."Curious why stop there, why even use OpenAI embeddings and not use, say, LLaMA embeddings and create a truly open stack.

评论 #37374405 未加载

评论 #37374137 未加载

评论 #37374117 未加载

tomhamer超过 1 年前

Marqo lets you use state of the art e5 embeddings (which are significantly more performant in retrieval than the openai embeddings), and will handle the embedding generation and retrieval on lucene indexes: <a href="https://www.marqo.ai/" rel="nofollow noreferrer">https://www.marqo.ai/</a>It's also available opensource: <a href="https://github.com/marqo-ai/marqo">https://github.com/marqo-ai/marqo</a>

bob1029超过 1 年前

Every week I feel like we get a few papers closer to "SQLite is all you need".A voice in my head seems adamant that the solution to this whole space of problems is neatly managed by one clever schema and minimal computational resources. It has only been growing louder and more confident in this over time.

评论 #37374215 未加载

评论 #37374339 未加载

runeblaze超过 1 年前

I think I am convinced that Lucene works well for embeddings + retrieval tasks from this preprint. I hope the paper provides direct comparison against Pinecone, Chroma, etc.. With enough budget you can probably do a user study too.Finer points:1. I remember FAISS is very accelerated for GPUs. How does Lucene compare there?2. "we’re not convinced that enterprises will make the (single, large) leap from an existing solution to a fully managed service" --> Fair point but not everyone uses Lucene? I feel it is weird that this "existing solution" (Lucene) is assumed to be already adopted.

评论 #37374241 未加载

评论 #37374171 未加载

kordlessagain超过 1 年前

FeatureBase is all you need: <a href="https://github.com/FeatureBaseDB/DoctorGPT">https://github.com/FeatureBaseDB/DoctorGPT</a>We'll be demo'ing an embedding service that uses Instructor Large/XL embeddings + GPT-4 keyterm extraction this next week.

catlover76超过 1 年前

Postgres is also a viable vector store.

评论 #37374156 未加载

idosh超过 1 年前

We're using Redis for vector search. It's pretty rad in terms of performance and other capabilities

cpill超过 1 年前

umm, how is it going to scale? how do you handle millions of vector per client and multiple clients? vector stores, like and DB is to simplify managing large scale data.

acedTrex超过 1 年前

I never read papers like this so excuse my ignorance but are sentences like this the norm?"We had to incorporate logic for error handling in our code, given the high-volume nature of our API calls"This just seems like an asinine thing to add to a technical paper. "We had to handle errors..."

评论 #37374220 未加载

评论 #37374025 未加载

评论 #37374254 未加载

评论 #37374144 未加载

评论 #37374078 未加载

评论 #37374495 未加载

评论 #37374028 未加载

评论 #37374026 未加载

m1117超过 1 年前

I think Lucene might be using Pinecone in the backend or something.

评论 #37374176 未加载

评论 #37374088 未加载

评论 #37374113 未加载