TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Understanding the BM25 full text search algorithm

305 pointsby rrampage6 months ago

8 comments

DavidPP6 months ago
We use <a href="https:&#x2F;&#x2F;typesense.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;typesense.org&#x2F;</a> for regular search, but it now has support for doing hybrid search, curious if anyone has tried it yet?
评论 #42195777 未加载
评论 #42209038 未加载
hubraumhugo6 months ago
Given the recent advances in vector-based semantic search, what&#x27;s the SOTA search stack that people are using for hybrid keyword + semantic search these days?
评论 #42193816 未加载
评论 #42193208 未加载
评论 #42194089 未加载
评论 #42193932 未加载
评论 #42193787 未加载
评论 #42193922 未加载
评论 #42193909 未加载
jll296 months ago
Nice write-up.<p>A few more details&#x2F;background that are harder to find: &quot;BM25&quot; stands for &quot;Best Matching 25&quot;, &quot;best matching&quot; becaue it is a formula for ranking and term weighting (the matching refers to the term in the query versus the document), and the number 25 simply indicates a running number (there were 24 earlier formula variants and some later ones, but #25 turned out to work best, so it was the one that was published).<p>It was conceived by Stephen Robertson and Karen Spärck Jones (the latter of IDF fame) and first implemented in the former&#x27;s OKAPI information retrieval (research) system. The OKAPI system was benchmarked at the annual US NIST TREC (Text Retrieval Conference) for a number of years, the international &quot;World Champtionship&quot; of search engine methods (although the event is not about winning, but about compariing notes and learning from each other, a highly recommended annual event held every November in Gaithersburg, Maryland, attended by global academic and industry teams that conduct research on improving search - see trec.nist.gov).<p>Besides the &quot;bag of words&quot; Vector Space Model (sparse vectors of terms), the Probabilistic Modles (that BM25 belongs to), there are suprising and still growing number of other theoretical frameworks how to rank a set of documents, given a query (&quot;Divergence from Randomness&quot;, &quot;Statistical Language Modeling, &quot;Learning to Rank&quot;, &quot;Quantum Information Retrieval&quot;, &quot;Neural Ranking&quot; etc.). Conferences like ICTIR and SIGIR still publish occasionaly entirely new paradigms for search. Note that the &quot;Statistical Language Modeling&quot; paradigm is not about Large Language Models that are on vogue now (that&#x27;s covered under the &quot;Neural Retrieval&quot; umbrella), and that &quot;Quantum IR&quot; is not going to get you to a tutorial about Quantum Information Retrieval but to methods of infrared spectroscopy or a company with the same name that produces cement; such are the intricacies of search technology, even in the 21st century.<p>If you want to play with BM25 and compare it with some of the alternatives, I recommend the research platform Terrier, and open-source search engine developed at the University of Glasgow (today, perhaps the epicenter of search research).<p>BM25 is over a quarter century old, but has proven to be a hard baseline to beat (it is still often used as a reference point for comparing new nethods against), and a more recent variant, BM24F, can deal with multiple fields and hypertext (e.g. title, body of documents, hyperlinks).<p>The recommended paper to read is: Spärck Jones, K.; Walker, S.; Robertson, S. E. (2000). &quot;A probabilistic model of information retrieval: Development and comparative experiments: Part 1&quot;. Information Processing &amp; Management 36(6): 779–808, and its successor, Part 2. (Sadly they are not open access.)
评论 #42196169 未加载
评论 #42191965 未加载
jankovicsandras6 months ago
Shameless plug:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;jankovicsandras&#x2F;plpgsql_bm25">https:&#x2F;&#x2F;github.com&#x2F;jankovicsandras&#x2F;plpgsql_bm25</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;jankovicsandras&#x2F;bm25opt">https:&#x2F;&#x2F;github.com&#x2F;jankovicsandras&#x2F;bm25opt</a>
评论 #42194312 未加载
评论 #42192810 未加载
MPSimmons6 months ago
Does anyone know if the average document length mentioned in the document length normalization is median? It seems like it would need to be to properly deweight excessively long documents, otherwise the excessively long documents would unfairly weight the average, right?
评论 #42195523 未加载
sidcool6 months ago
Good article. I am genuinely interested to learn about how to think of problems in such a mathematical form. And how to test it. Any resources?
tselvaraj6 months ago
Hybrid search solves the long-standing challenge of relevance with search results. We can use ranking fusion between keyword and vector to create a hybrid search that works in most scenarios.
RA_Fisher6 months ago
BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and &#x2F; or are incentivized to keep the old tech going as long as possible, but market pressures will change that.
评论 #42192828 未加载
评论 #42192735 未加载
评论 #42192805 未加载
评论 #42194229 未加载