Introduction to Information Retrieval by Manning et al is a great text on the subject: <a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html" rel="nofollow">http://nlp.stanford.edu/IR-book/information-retrieval-book.h...</a>
You can take a look on Managing Gigabytes (<a href="http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?ie=UTF8&s=books&qid=1272964874&sr=8-1" rel="nofollow">http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...</a>)<p>It is nice book, but might be little bit outdated.
The basic data structure behind most text databases and search engines is an inverted index.<p>Basically it's a map of words to a list of documents containing that word.<p>eg
{"hello": [1, 2], "world": [1,3,4], ...}<p>(the numbers are document id's)<p>So for example, the word 'hello' occurs in documents 1 and 2. 'world' occurs in documents 1,3 and 4.<p>Doing boolean queries is also really easy with an inverted index. You basically get the document set for each word and then do a union on the sets for an OR query.. or an intersection to do an AND query.<p>Pretty cool right?
The book by Manning (freely available online) has already been recommended. I would start with this.<p>In addition there are a wealth of online video lectures that may inspire you:
<a href="http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering" rel="nofollow">http://www.datawrangling.com/hidden-video-courses-in-math-sc...</a>
and
<a href="http://videolectures.net/mlss04_hofmann_irtm/" rel="nofollow">http://videolectures.net/mlss04_hofmann_irtm/</a>
and
<a href="http://videolectures.net/Top/Computer_Science/" rel="nofollow">http://videolectures.net/Top/Computer_Science/</a><p>In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.<p>For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout <a href="http://lucene.apache.org/mahout/" rel="nofollow">http://lucene.apache.org/mahout/</a>
For a high level overview I'd recommend Tim Bray's On Search series: <a href="http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC" rel="nofollow">http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...</a>
I would echo the recommendations for Introduction to Information Retrieval. If you want something with the same concepts but a little less math, I liked Search Engine: Information Retrieval in Practice by Bruce Croft, et al as well. If you happen to be a python programmer, Natural Language Processing with Python by Steven Bird has some great examples of text processing.
I'm literally in the library right now, working on my last homework for a Search Engines course taught by Distinguished Professor Bruce Croft. We're using his new book, Search Engines: Information Retrieval in Practice. You can find the slides that accompany the book here:<p><a href="http://www.search-engines-book.com" rel="nofollow">http://www.search-engines-book.com</a> - Slides, Data Sets<p><a href="http://www.pearsonhighered.com/croft1epreview/toc.html" rel="nofollow">http://www.pearsonhighered.com/croft1epreview/toc.html</a> - Book Table of Contents<p>The book expands on the slides, as well as includes homework problems, some requiring the use or modifications of the open-source Galago Search Toolkit.<p><a href="http://www.galagosearch.org/quick-start.html" rel="nofollow">http://www.galagosearch.org/quick-start.html</a>
I've been deep into building a geocoder the past month. While we may get rid of Solr eventually, it was a great foot in the door to information retrieval. It helps that I have a problem to solve and a deadline, so I'm motivated to read and work through these books. These three texts have been very helpful. The last book is an excellent overview of text processing and some real world problems you may encounter writing your search engine.<p>Solr 1.4 Enterprise Search Server
<a href="http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881" rel="nofollow">http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1...</a><p>Programming Collective Intelligence
<a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325" rel="nofollow">http://www.amazon.com/Programming-Collective-Intelligence-Bu...</a><p>Building Search Applications: Lucene, LingPipe, and Gate
<a href="http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/dp/0615204252" rel="nofollow">http://www.amazon.com/Building-Search-Applications-Lucene-Li...</a>
Whatever web app framework you favor, there should be plugins for SOLR and sphinx that make fulltext indexing with reasonable defaults pretty easy. i.e. for rails thinking sphinx. I used to use acts_as_solr (I think a lot of people use sunspot now, and Xapian).<p>Play with a database or docs in a filesystem, do deltas of SOLR and sphinx, changing parameters like stopwords, token separators, stemmers, UTF-8 and ISO-Latin to ASCII mappings. See if you can get decent precision/recall metrics. There's quite a few degrees of freedom, depending on the database.<p><a href="http://www.computationalmedicine.org/challenge/cmcChallengeDetails.pdf" rel="nofollow">http://www.computationalmedicine.org/challenge/cmcChallengeD...</a><p><a href="http://stackoverflow.com/questions/tagged/sphinx" rel="nofollow">http://stackoverflow.com/questions/tagged/sphinx</a>
I must cite Programming Collective Intelligence from Toby Segaram (<a href="http://oreilly.com/catalog/9780596529321" rel="nofollow">http://oreilly.com/catalog/9780596529321</a>). Altough not entirely focused on search engines, it's an awesome book for anyone who wants to get their hands on some of the most useful algorithms for web apps, without having to deal with the math.<p>I downloaded a torrent version, then bought the paperback version straight after.
By building a small search engine. It took about 4 months and it was definitely worth it. Some of the problems that seem simple at first glance were terribly hard (such as reliably separating out the body text of a web page), and some that I thought would be hard turned out to be relatively easy (the actual index).<p>It was a lot of fun, even if when I started out I was already fairly sure that I would not have the stamina nor the funds to commercialize it but as a learning experience it was great.