TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The math and algorithms behind a search library

202 pointsby kbrover 7 years ago

8 comments

JamesMcMinnover 7 years ago
This is a an interesting way of doing text search over a collection of documents, that I suppose makes sense if you only have a small number of documents to search.<p>Generally, an inverted index [1] is used rather than a trie, and collection-based term weights are calculated using something like IDF [2]. You could then use something like cosine to compute length normalised rankings for documents.<p>There are of course many different approaches to text search, but this is the first time I&#x27;ve personally come across this one.<p>If anyone is interested in an in depth look at how text search works, an Introduction to Information Retrieval [3] is available for free and is a fantastic book.<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Inverted_index" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Inverted_index</a><p>[2] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tf%E2%80%93idf" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tf%E2%80%93idf</a><p>[3] <a href="https:&#x2F;&#x2F;nlp.stanford.edu&#x2F;IR-book&#x2F;" rel="nofollow">https:&#x2F;&#x2F;nlp.stanford.edu&#x2F;IR-book&#x2F;</a>
评论 #15677315 未加载
ameliusover 7 years ago
Finally a post on search algorithms on HN again!<p>Pardon my surprise, but I think it&#x27;s kind of strange that we see so little on HN of this topic, which is one of the pillars of Computer Science, and also at the core of one of the most successful IT companies.
评论 #15677627 未加载
wycover 7 years ago
If you&#x27;re a working programmer looking for a bit more depth on text searching, check out Taming Text by Ingersoll. It was written by some real pros, and it&#x27;s very practical. It goes over everything from the first naive approaches to what drives many engines like Lucene today.<p><a href="https:&#x2F;&#x2F;www.amazon.com&#x2F;Taming-Text-Find-Organize-Manipulate&#x2F;dp&#x2F;193398838X" rel="nofollow">https:&#x2F;&#x2F;www.amazon.com&#x2F;Taming-Text-Find-Organize-Manipulate&#x2F;...</a>
stochastic_monkover 7 years ago
I&#x27;m a little surprised to not be seeing some classical search algorithms.<p>For online pattern searching, I&#x27;d want Boyer-Moore with the the Galil rule for guaranteed linear search. [Due to an unfortunate accident in C++ history, Boost and C++17 skip the Galil rule, leading to O(n+m) runtime vs. O(nm).]<p>For offline, Burrows-Wheeler has performance linear with query string and <i>independent</i> of the length of the searched document.
评论 #15677703 未加载
bootcatover 7 years ago
Certainly its worth taking a look at another javascript library which works for fuzzy search called, fuse, <a href="http:&#x2F;&#x2F;fusejs.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;fusejs.io&#x2F;</a>.
jihadjihadover 7 years ago
This is neat, and reminds me a bit of the tricks used by RediSearch. I came across it on HN recently and have been geeking out on it since! <a href="http:&#x2F;&#x2F;redisearch.io" rel="nofollow">http:&#x2F;&#x2F;redisearch.io</a>
johnhenryover 7 years ago
Lunr, named to correspond to Apache Solr, is also worth a look. <a href="https:&#x2F;&#x2F;lunrjs.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lunrjs.com&#x2F;</a>
hnrcover 7 years ago
<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=14805395" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=14805395</a>
评论 #15680216 未加载
评论 #15680206 未加载