TechEcho

13 comments

DrJosiahover 11 years ago

There are two methods that are more typically known in the context of substring search that may help you get something even better than your prefix + Levenshtein distance 1 searches.The first and most common is stemming, it can be useful, but it requires a lot of hand tuning and doesn't get you all that much (at least in my experience).The second is using phonetics of words themselves to try to figure out what someone meant to type in. The grand-daddy of these is Soundex, which works, but isn't as good as Metaphone. Metaphone and Double Metaphone are algorithms that translate most latin-alphabet languages into possible abbreviated phonemes.Double Metaphone, in particular, has allowed me to build spelling-agnostic prefix matching as an effectively trivial service for two different companies (the details of which you can read about in the Redis mailing list, and/or section 6.1.2 in my book, Redis in Action). The long and short of it is that if you used double metaphone, for the cost of 2-3x index size, you could get all of the results you are looking for in 2-3 searches instead of length(word) searches.

评论 #6299813 未加载

评论 #6299880 未加载

diegoover 11 years ago

This is cool. However, I wonder why the author didn't switch to a Lucene-based server such as Elasticsearch or Solr, for which you can find decent Go clients (or use the REST api directly). The NGram tokenizer lets you do infix search out of the box. By the way, the author says Lucene is "bloated." That's just nonsense, like saying MySQL is "bloated." LinkedIn and Twitter use Lucene to handle insane numbers of QPS with real-time updates. There is no excuse for not trying a search server before deciding to hack something new.My impression from reading the post is that they went from an extremely inefficient solution straight to one that seems way overengineered for the company's scale. The app has only 28 ratings on iTunes, which means that it's not greatly popular ($150/month is not much more than my personal AWS bill).My guess is that the highest ROI of this code (besides the joy of hacking) comes from the HN page views. Usually, optimizing software to this level is a bad idea for a small startup. You're not Twitter, Github, DropBox. Code less and grow your business more.

danieldkover 11 years ago

It's nice to see suffix arrays being used in the real world. For anyone who is interested in suffix arrays, the following to papers are recommended:* Sufﬁx arrays: A new method for on-line string searches, U Manber, G Myers, 1993* Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, M Yamamoto, KW Church, 2001The second paper actually uses an 'inverted index' for matching sequences to documents.

评论 #6298069 未加载

评论 #6297933 未加载

评论 #6303006 未加载

Scaevolusover 11 years ago

Related: Russ Cox wrote and documented a miniature version of Google Code Search (a fast regex search engine) in Go: <a href="http://swtch.com/~rsc/regexp/regexp4.html" rel="nofollow">http://swtch.com/~rsc/regexp/regexp4.html</a>Using a Suffix Array should make it possible to do fuzzy matching easily, but I haven't examined precisely how they combined it with an inverted index.

theduferover 11 years ago

> Reading from the harddrive is very slow.Why were you reading from harddrive with Mongo? Prefix indexing is the only kind of string indexing that Mongo does, but it does a pretty good job with it. You must have a pretty large dataset if it couldn't keep the index in memory.> It could be easily hit with a regex injectionIt should be pretty easy to regex-escape the search string. Easier than building a substring searching engine.> It couldn't handle artists with multiple namesThat sounds like a data format issue.What I'm getting at is that, while the prefix-only issue might have been enough to justify this switch, none of your other reasons make a ton of sense to me.

fleitzover 11 years ago

Amazing, servicing thousands of users for $150 a month.I'd clearly start writing my own indexing engine instead of using Posgresql, or maybe switching to digital ocean/hetzner.

评论 #6298569 未加载

评论 #6299800 未加载

drsintomaover 11 years ago

nice. I've been waiting to see some full-text search engine in Go for a while to compete with the omnipresent Lucene. Things like Go's low memory footprint could give it an important advantage in this area. This looks like a first step in that direction.

评论 #6298258 未加载

评论 #6299221 未加载

评论 #6298454 未加载

adcuzover 11 years ago

Future HN headline: "Startup accelerator written in Go"

fspeechover 11 years ago

Very cool. Considering that the core functionality you want is done in less than 500 LOC (<a href="https://github.com/argusdusty/Ferret/blob/master/ferret.go" rel="nofollow">https://github.com/argusdusty/Ferret/blob/master/ferret.go</a>) it seems a great way to go :) instead of depending on a large external library. OTOH you could have probably just as easily written it in another performant language. But since Go happens to be performant and is already used in other parts of your project things just turned out great for you.

jphover 11 years ago

Is this related to the Ferret search engine for Ruby? <a href="https://en.wikipedia.org/wiki/Ferret_search_library" rel="nofollow">https://en.wikipedia.org/wiki/Ferret_search_library</a>

评论 #6298072 未加载

评论 #6298053 未加载

6thSigmaover 11 years ago

Clicking on the logo gives a 404 error by the way.

评论 #6299103 未加载

oronover 11 years ago

doesn't work for me ... looks like it's stuck

rorrr2over 11 years ago

$150/month can buy you a very very good dedicated server.<a href="http://www.hetzner.de/en/hosting/produkte_rootserver/ex10" rel="nofollow">http://www.hetzner.de/en/hosting/produkte_rootserver/ex10</a>

评论 #6298784 未加载

13 comments

DrJosiahover 11 years ago

评论 #6299813 未加载

评论 #6299880 未加载

diegoover 11 years ago

danieldkover 11 years ago

评论 #6298069 未加载

评论 #6297933 未加载

评论 #6303006 未加载

Scaevolusover 11 years ago

theduferover 11 years ago

fleitzover 11 years ago

Amazing, servicing thousands of users for $150 a month.I'd clearly start writing my own indexing engine instead of using Posgresql, or maybe switching to digital ocean/hetzner.

评论 #6298569 未加载

评论 #6299800 未加载

drsintomaover 11 years ago

评论 #6298258 未加载

评论 #6299221 未加载

评论 #6298454 未加载

adcuzover 11 years ago

Future HN headline: "Startup accelerator written in Go"

fspeechover 11 years ago

jphover 11 years ago

Is this related to the Ferret search engine for Ruby? <a href="https://en.wikipedia.org/wiki/Ferret_search_library" rel="nofollow">https://en.wikipedia.org/wiki/Ferret_search_library</a>

Show HN: We built a fast substring search engine in Go

13 comments

Show HN: We built a fast substring search engine in Go

13 comments