科技回声

2 条评论

trengrj大约 2 年前

I work at Weaviate, a few comments on why we implemented hybrid search [1].- Using two separate systems for traditional BM25 and vector search and keeping them in-sync is pretty difficult from an operations perspective. A combined system is much easier to manage and will have better end-to-end latency.- For combining scores, a linear combination like this article suggests is not recommended and instead rank fusion <a href="https://rodgerbenham.github.io/bc17-adcs.pdf" rel="nofollow">https://rodgerbenham.github.io/bc17-adcs.pdf</a> (where you care what each method ranks first rather than the absolute score) is used.- The point of adding both search methods is for dealing with what researchers term "out of domain data". This is for datasets the model producing the vectors was not trained on. Research from Google <a href="https://arxiv.org/abs/2201.10582" rel="nofollow">https://arxiv.org/abs/2201.10582</a> suggests hybrid search with rank fusion helps in this case by around 20.4%. For "in domain" data, the model (usually transformer based) will out perform BM25.- Using a cross encoder [2] is a good component to add to improve relevance. It will just though rerank the final results, so if the initial search returns 100 garbage results the cross encoder won't be able to help.[1] <a href="https://weaviate.io/blog/hybrid-search-explained" rel="nofollow">https://weaviate.io/blog/hybrid-search-explained</a> [2] <a href="https://www.sbert.net/examples/applications/cross-encoder/README.html" rel="nofollow">https://www.sbert.net/examples/applications/cross-encoder/RE...</a>

评论 #35054794 未加载

评论 #35054642 未加载

CShorten大约 2 年前

Hey everyone, I also work at Weaviate.Weaviate has implemented Hybrid Search because it helps with search performance in a few ways (Zero-Shot, Out-of-Domain, Continual Learning). Members of the Qdrant team are arguing against implementing Hybrid Search in Vector Databases with 3 main points that I believe are incorrect: 1. That there are not comparative benchmarks on Hybrid Search. 2. “Multi-Tool” systems are generally flawed by design in favor of specialization. 3. Cross-Encoder inference does not need additional processing modules - this one is also more so related to the particular design differences of Weaviate and Qdrant, where the Qdrant team is again arguing that you don’t need to implement the thing you implemented.TLDR1. Such benchmarks exist as in trengrj's initial response and we are working on them as well.2. There are a few arguments why adding sparse search doesn't require too much extra specialization, how it is already used in filtered vector search to begin with, and why it makes sense to apply the rank fusion in the database where each scoring method happens.3. Cross Encoder inference generally doesn't happen in the database itself thus it makes sense to use modules to process the additional ranking logic. There are several other examples of inferences in search pipelines that this kind of design enables.More detail:1. “You don’t publish comparative benchmarks”Firstly, jtrengrj has responded with exactly this request. I think it’s actually better than it comes from a 3rd party as well, since clearly Weaviate is biased in having implemented Hybrid Search and Qdrant is biased in not having implemented Hybrid Search.However, here is a quick overview of benchmarking efforts at Weaviate so far.The focus of Weaviate has been primarily understanding Approximate Nearest Neighbor vector search of which very thorough benchmarks have been published ablating hyperparameters of HNSW such as maxConnections, efConstruction, and ef. This is done to measure recall with respect to approximation.ANN Benchmarks - <a href="https://weaviate.io/developers/weaviate/benchmarks/ann" rel="nofollow">https://weaviate.io/developers/weaviate/benchmarks/ann</a>Podcast about this :) - <a href="https://www.youtube.com/watch?v=kG3ji89AFyQ">https://www.youtube.com/watch?v=kG3ji89AFyQ</a>With respect to comparative benchmarks that report IR metrics such as nDCG, hits at K, recall, precision, … We are beginning with this using the BEIR benchmarks. The BEIR benchmarks are much more of an industry standard for reporting the performance of BM25, Dense Retrieval, Hybrid, Cross Encoders, …The qdrant team has taken 2 rather random examples of e-commerce datasets. Their results also conclude by advocating for Hybrid Search although differently in the sense of aggregating results to send to cross encoder rather than a rank fusion of each result list. The key challenge with that is that the cross encoder inference is very slow — more on that in point 2.I think there is value to benchmarking eCommerce Search datasets because of the way they capture Multimodal, but this isn’t really much of a standard yet. Comparatively, many independent companies and researchers have reproduced the BEIR metrics.Our findings so far support that there is no free lunch with this — BM25, Dense, or Hybrid does not consistently outperform the others.Here is a quick preview of our BEIR nDCG results so far Note these are subject to change, some have been tested with WAND scoring and others have not, * denotes BM25 scoring with WAND. Hybrid is tested with alpha = 0.5: Vector Embeddings are done with sentence transformers `all-MiniLM-L6-v2`NFCorpus, BM25 = 0.224, Hybrid = 0.280, Vector only = 0.265 FiQA, Bm25 = 0.284, Hybrid = 0.428, Vector only = 0.434 SCIFACT, Bm25 = 0.678, Hybrid = 0.714, Vector only = 0.683 ArguAna, BM25 = 0.368, Hybrid = 0.408, Vector Only = 0.411 Touche2020, BM25 = 0.351, Hybrid = 0.364, Vector Only = 0.249 *Quora, BM25 = 0.770, Hybrid = 0.867, 0.887These BM25 results are similar to Vespa’s BM25 results<a href="https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa-part-two/" rel="nofollow">https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa...</a>The primary reason these scores are different is because I am not accounting for multi-level relevance which is just due to my lack of understanding of the dataset to begin with. I will correct this when we officially publish the Weaviate BEIR benchmarks.Vespa BM25 NFCorpus - 0.313 || Weaviate = 0.224 FIQA - 0.244 || Weaviate = 0.284 SciFact - 0.673 || Weaviate = 0.678 ArguAna - 0.393 || Weaviate = 0.368 Touche2020 - 0.413 || Weaviate = 0.351 Quora - 0.761 || Weaviate = 0.770This is of course very incomplete, these are only 6 / 14 BEIR datasets - as an update for those interested in Weaviate’s progress with these benchmarks - TREC-COVID and SCIDOCS really need to be updated with the multi-level relevance scores otherwise the scores really give a bad picture of the performance. FEVER, Climate-FEVER, HotpotQA, EnttiyDB, and MS MARCO have been vectorized but still need to be imported and then backed up in Weaviate for the sake of reproducibility.The key point here is that “There is no free lunch” Hybrid helps catch the cases where BM25 work well AND when Vector Search works well. For Touche2020, SCIFACT, and NFCorpus - we get better results with Hybrid. In alignment with the underlying rank fusion algorithm, there isn’t a case where Hybrid Search is dramatically outperformed by either Bm25 or Vector Search only.This is very important for vector databases, because the Zero-Shot performance or ability to cover 80% of the use cases is a huge enabler in our ability to evangelize the technology. Collectively, Vector Databases need to illustrate the potential value for searching through all sorts of domains from code documentation to emails, personal notes, eCommerce (as you mention), etc. The big point here is that Hybrid provides another performance layer to help avoid cases where Vector Search fails and people are put off of the technology.So once people are interested in using Vector Search — now we have more of a Deep Learning problem of continual learning of the embeddings. For example, if you are using it for Code Documentation search and Weaviate introduces a new feature like ref2vec, the Dense model will not have a semantic embedding for this until it is optimized with the new data. This is another enormous application of Hybrid Search to use the keyword scoring to adapt to new terms faster than the Deep Learning models can be optimized to do so.2. Multi-tool systemThis argument completely lacks substance for our particular conversation here. Vector databases already integrate inverted indexing for filtered vector search. It makes a ton of sense to adopt these same building blocks into the sparse indexing, and further it makes a ton of sense to apply the rank fusion in the same database rather the networking ranked lists at scale. The scalability patterns overlap.Plus you failed to acknowledge the key point of “it’s easier to manage”.Generally this is just ugly communication. It reads like someone who is pissed off rather than wanting to have an honest discussion of the technology.3. Cross Encoders3A. The most important thing here is that Cross Encoder inference is very slow — don’t agree at all with “In the case of document retrieval, we care more about the search result quality and time is not a huge constraint”. Further Cross Encoders generally need to run on GPUs which is expensive.3B. The module system is used to process the logic of Cross Encoder inference - whether self-hosted or OpenAI / etc.It is most likely that re-rankers will come from OpenAI, Cohere, HuggingFace Inference Endpoints, … model as API generally. Or things like Metarank that host XGBoost APIs. They send predictions over network requests that you process with an external module (i.e. these predictions don’t happen in the database directly).Of course there are more kinds of model inferences we want to use in Search Pipelines rather than just Cross Encoders (Question Answering, Summarization, …) and thus the module system handles the nuances of each respective model inference.

On Hybrid Search with Qdrant

2 条评论

On Hybrid Search with Qdrant

2 条评论