“Relevant search” by Doug Turnbull and John Berryman, published by Manning, is THE best book to get started with tuning search engines.<p>I’be been a search engineer for >10 years and this is always the first book I recommend.<p><a href="https://www.manning.com/books/relevant-search" rel="nofollow">https://www.manning.com/books/relevant-search</a>
Three reference textbooks are available openly:<p>* Introduction to Information Retrieval, <a href="http://informationretrieval.org/" rel="nofollow">http://informationretrieval.org/</a><p>* Information Retrieval in Practice, <a href="http://www.search-engines-book.com/" rel="nofollow">http://www.search-engines-book.com/</a><p>* Entity-Oriented Search, <a href="https://eos-book.org/" rel="nofollow">https://eos-book.org/</a><p>Modern Information Retrieval is also a classic reference. Not openly available but some contents are (were?) available online. Their site seems to be down but the Internet Archive has a copy.<p>Additional resources here:<p>* <a href="https://nlp.stanford.edu/IR-book/information-retrieval.html" rel="nofollow">https://nlp.stanford.edu/IR-book/information-retrieval.html</a>
<a href="http://web.archive.org/web/20220708135205/http://grupoweb.upf.es/mir2ed/" rel="nofollow">http://web.archive.org/web/20220708135205/http://grupoweb.up...</a>
At a general audience level, "Index" is on my list to read. It covers the invention of the index up to digital search engines. <a href="https://www.nytimes.com/2022/02/09/books/review-index-history-of-dennis-duncan.html" rel="nofollow">https://www.nytimes.com/2022/02/09/books/review-index-histor...</a><p>"Introduction to Information Retrieval" is a textbook which is available online <a href="https://nlp.stanford.edu/IR-book/" rel="nofollow">https://nlp.stanford.edu/IR-book/</a> Here's a review: <a href="http://glinden.blogspot.com/2009/02/book-review-introduction-to-information.html" rel="nofollow">http://glinden.blogspot.com/2009/02/book-review-introduction...</a><p>Another textbook which IMHO is a bit lower level is "Information Retrieval: Implementing and Evaluating Search Engines". The book website is down for me right now, but you can find it on Amazon here: <a href="https://www.amazon.com/Information-Retrieval-Implementing-Evaluating-Engines/dp/0262026511" rel="nofollow">https://www.amazon.com/Information-Retrieval-Implementing-Ev...</a><p>Another commenter linked to "Relevant Search", which is great if you want to learn how to effectively use a search engine to improve relevance (as opposed to how to implement a search engine). It's old, but another book in that vein that was really helpful for me earlier in my career is Lucene in Action: <a href="https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp/1933988177/" rel="nofollow">https://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp...</a>
Not a book, but this paper from 2019 covers a lot of ground and reviews the different topics extensively: <a href="https://tonellotto.github.io/publication/fntir/fntir_main.pdf" rel="nofollow">https://tonellotto.github.io/publication/fntir/fntir_main.pd...</a>
Take a look at my post “Lucene: The Good Parts”—<p><a href="https://blog.parse.ly/lucene/" rel="nofollow">https://blog.parse.ly/lucene/</a><p>The book mentioned there is Lucene in Action.<p>And then this YouTube presentation by a Lucene/Elasticsearch committer will give you a nice overview of some related algorithms—<p><a href="https://youtu.be/eQ-rXP-D80U" rel="nofollow">https://youtu.be/eQ-rXP-D80U</a>
Not a book but Hellerstein’s CS186 from 2015 starting with Lecture 17 gave me a basic understanding (I think).<p>Playlist <a href="https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_QwjYzZ0Am1" rel="nofollow">https://youtube.com/playlist?list=PLhMnuBfGeCDPtyC9kUf_hG_Qw...</a><p>Also from that lecture series, the low level is always IO. One disk read tends to dwarf n^2 in-memory algorithms.<p>And IO is all about tuning caches and hardware for the specific structural relationships in the data, the way in which it is accessed, and the hardware everything runs on.<p>Good luck.
Check the literature of open courses on Text Retrieval. E.g. <a href="https://stanford.edu/class/cs276/" rel="nofollow">https://stanford.edu/class/cs276/</a>
series of tutorials and comparisons that aim to teach a foundations about vector search:<p><a href="https://vectorsearch.dev/" rel="nofollow">https://vectorsearch.dev/</a>
It's all in the Nim programming language, but if you prefer reading code or running diffs then you might get a vague sense of (some) low level nuts & bolts from: <a href="https://github.com/c-blake/nimsearch" rel="nofollow">https://github.com/c-blake/nimsearch</a>
Is there some better alternative to Knuth-Morris-Pratt or Boyer-Moore? Both can easily be adapted to regular expression matching and as far as I know there’s no faster algorithm that doesn’t do preprocessing.
Just use Postgres fulltext Search, its good enough <a href="http://rachbelaid.com/postgres-full-text-search-is-good-enough/" rel="nofollow">http://rachbelaid.com/postgres-full-text-search-is-good-enou...</a>