Ask HN: How did you learn about search engines/text processing ?

46 点作者 yr大约 15 年前

Any good videos/books/code ?

14 条评论

tdmackey大约 15 年前

Introduction to Information Retrieval by Manning et al is a great text on the subject: <a href="http://nlp.stanford.edu/IR-book/information-retrieval-book.html" rel="nofollow">http://nlp.stanford.edu/IR-book/information-retrieval-book.h...</a>

dejv大约 15 年前

You can take a look on Managing Gigabytes (<a href="http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703/ref=sr_1_1?ie=UTF8&s=books&qid=1272964874&sr=8-1" rel="nofollow">http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...</a>)It is nice book, but might be little bit outdated.

dhotson大约 15 年前

The basic data structure behind most text databases and search engines is an inverted index.Basically it's a map of words to a list of documents containing that word.eg {"hello": [1, 2], "world": [1,3,4], ...}(the numbers are document id's)So for example, the word 'hello' occurs in documents 1 and 2. 'world' occurs in documents 1,3 and 4.Doing boolean queries is also really easy with an inverted index. You basically get the document set for each word and then do a union on the sets for an OR query.. or an intersection to do an AND query.Pretty cool right?

评论 #1317563 未加载

grrrr大约 15 年前

The book by Manning (freely available online) has already been recommended. I would start with this.In addition there are a wealth of online video lectures that may inspire you: <a href="http://www.datawrangling.com/hidden-video-courses-in-math-science-and-engineering" rel="nofollow">http://www.datawrangling.com/hidden-video-courses-in-math-sc...</a> and <a href="http://videolectures.net/mlss04_hofmann_irtm/" rel="nofollow">http://videolectures.net/mlss04_hofmann_irtm/</a> and <a href="http://videolectures.net/Top/Computer_Science/" rel="nofollow">http://videolectures.net/Top/Computer_Science/</a>In so far as search engines go it's certainly worth playing around with Lucene. It's well implemented and you'll learn a lot of what really matters when it comes to indexing and retrieval.For the text processing (classification, data extraction) side It may also be worth brushing up on your stats (a good excuse to learn R) and checking out Mahout <a href="http://lucene.apache.org/mahout/" rel="nofollow">http://lucene.apache.org/mahout/</a>

评论 #1317716 未加载

uggedal大约 15 年前

For a high level overview I'd recommend Tim Bray's On Search series: <a href="http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTOC" rel="nofollow">http://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchTO...</a>

评论 #1318403 未加载

评论 #1318333 未加载

rmc00大约 15 年前

I would echo the recommendations for Introduction to Information Retrieval. If you want something with the same concepts but a little less math, I liked Search Engine: Information Retrieval in Practice by Bruce Croft, et al as well. If you happen to be a python programmer, Natural Language Processing with Python by Steven Bird has some great examples of text processing.

vlad大约 15 年前

I'm literally in the library right now, working on my last homework for a Search Engines course taught by Distinguished Professor Bruce Croft. We're using his new book, Search Engines: Information Retrieval in Practice. You can find the slides that accompany the book here:<a href="http://www.search-engines-book.com" rel="nofollow">http://www.search-engines-book.com</a> - Slides, Data Sets<a href="http://www.pearsonhighered.com/croft1epreview/toc.html" rel="nofollow">http://www.pearsonhighered.com/croft1epreview/toc.html</a> - Book Table of ContentsThe book expands on the slides, as well as includes homework problems, some requiring the use or modifications of the open-source Galago Search Toolkit.<a href="http://www.galagosearch.org/quick-start.html" rel="nofollow">http://www.galagosearch.org/quick-start.html</a>

ghotli大约 15 年前

I've been deep into building a geocoder the past month. While we may get rid of Solr eventually, it was a great foot in the door to information retrieval. It helps that I have a problem to solve and a deadline, so I'm motivated to read and work through these books. These three texts have been very helpful. The last book is an excellent overview of text processing and some real world problems you may encounter writing your search engine.Solr 1.4 Enterprise Search Server <a href="http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1847195881" rel="nofollow">http://www.amazon.com/Solr-1-4-Enterprise-Search-Server/dp/1...</a>Programming Collective Intelligence <a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325" rel="nofollow">http://www.amazon.com/Programming-Collective-Intelligence-Bu...</a>Building Search Applications: Lucene, LingPipe, and Gate <a href="http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/dp/0615204252" rel="nofollow">http://www.amazon.com/Building-Search-Applications-Lucene-Li...</a>

gtani大约 15 年前

Whatever web app framework you favor, there should be plugins for SOLR and sphinx that make fulltext indexing with reasonable defaults pretty easy. i.e. for rails thinking sphinx. I used to use acts_as_solr (I think a lot of people use sunspot now, and Xapian).Play with a database or docs in a filesystem, do deltas of SOLR and sphinx, changing parameters like stopwords, token separators, stemmers, UTF-8 and ISO-Latin to ASCII mappings. See if you can get decent precision/recall metrics. There's quite a few degrees of freedom, depending on the database.<a href="http://www.computationalmedicine.org/challenge/cmcChallengeDetails.pdf" rel="nofollow">http://www.computationalmedicine.org/challenge/cmcChallengeD...</a><a href="http://stackoverflow.com/questions/tagged/sphinx" rel="nofollow">http://stackoverflow.com/questions/tagged/sphinx</a>

probably大约 15 年前

Text Processing in Python: <a href="http://gnosis.cx/TPiP/" rel="nofollow">http://gnosis.cx/TPiP/</a>

gregschlom大约 15 年前

I must cite Programming Collective Intelligence from Toby Segaram (<a href="http://oreilly.com/catalog/9780596529321" rel="nofollow">http://oreilly.com/catalog/9780596529321</a>). Altough not entirely focused on search engines, it's an awesome book for anyone who wants to get their hands on some of the most useful algorithms for web apps, without having to deal with the math.I downloaded a torrent version, then bought the paperback version straight after.

评论 #1318061 未加载

DrJokepu大约 15 年前

Search and Text Processing course at university. Unfortunately I can't find anything related amongst MIT's online course materials.

jacquesm大约 15 年前

By building a small search engine. It took about 4 months and it was definitely worth it. Some of the problems that seem simple at first glance were terribly hard (such as reliably separating out the body text of a web page), and some that I thought would be hard turned out to be relatively easy (the actual index).It was a lot of fun, even if when I started out I was already fairly sure that I would not have the stamina nor the funds to commercialize it but as a learning experience it was great.

keefe大约 15 年前

downloading lucene is a nice place to start if you're a java person