There are two methods that are more typically known in the context of substring search that may help you get something even better than your prefix + Levenshtein distance 1 searches.<p>The first and most common is stemming, it can be useful, but it requires a lot of hand tuning and doesn't get you all that much (at least in my experience).<p>The second is using phonetics of words themselves to try to figure out what someone meant to type in. The grand-daddy of these is Soundex, which works, but isn't as good as Metaphone. Metaphone and Double Metaphone are algorithms that translate most latin-alphabet languages into possible abbreviated phonemes.<p>Double Metaphone, in particular, has allowed me to build spelling-agnostic prefix matching as an effectively trivial service for two different companies (the details of which you can read about in the Redis mailing list, and/or section 6.1.2 in my book, Redis in Action). The long and short of it is that if you used double metaphone, for the cost of 2-3x index size, you could get all of the results you are looking for in 2-3 searches instead of length(word) searches.
This is cool. However, I wonder why the author didn't switch to a Lucene-based server such as Elasticsearch or Solr, for which you can find decent Go clients (or use the REST api directly). The NGram tokenizer lets you do infix search out of the box. By the way, the author says Lucene is "bloated." That's just nonsense, like saying MySQL is "bloated." LinkedIn and Twitter use Lucene to handle insane numbers of QPS with real-time updates. There is no excuse for not trying a search server before deciding to hack something new.<p>My impression from reading the post is that they went from an extremely inefficient solution straight to one that seems <i>way</i> overengineered for the company's scale. The app has only 28 ratings on iTunes, which means that it's not greatly popular ($150/month is not much more than my personal AWS bill).<p>My guess is that the highest ROI of this code (besides the joy of hacking) comes from the HN page views. Usually, optimizing software to this level is a bad idea for a small startup. You're not Twitter, Github, DropBox. Code less and grow your business more.
It's nice to see suffix arrays being used in the real world. For anyone who is interested in suffix arrays, the following to papers are recommended:<p>* Suffix arrays: A new method for on-line string searches, U Manber, G Myers, 1993<p>* Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, M Yamamoto, KW Church, 2001<p>The second paper actually uses an 'inverted index' for matching sequences to documents.
Related: Russ Cox wrote and documented a miniature version of Google Code Search (a fast regex search engine) in Go: <a href="http://swtch.com/~rsc/regexp/regexp4.html" rel="nofollow">http://swtch.com/~rsc/regexp/regexp4.html</a><p>Using a Suffix Array <i>should</i> make it possible to do fuzzy matching easily, but I haven't examined precisely how they combined it with an inverted index.
> Reading from the harddrive is very slow.<p>Why were you reading from harddrive with Mongo? Prefix indexing is the only kind of string indexing that Mongo does, but it does a pretty good job with it. You must have a pretty large dataset if it couldn't keep the index in memory.<p>> It could be easily hit with a regex injection<p>It should be pretty easy to regex-escape the search string. Easier than building a substring searching engine.<p>> It couldn't handle artists with multiple names<p>That sounds like a data format issue.<p>What I'm getting at is that, while the prefix-only issue might have been enough to justify this switch, none of your other reasons make a ton of sense to me.
Amazing, servicing <i>thousands</i> of users for $150 a month.<p>I'd clearly start writing my own indexing engine instead of using Posgresql, or maybe switching to digital ocean/hetzner.
nice. I've been waiting to see some full-text search engine in Go for a while to compete with the omnipresent Lucene. Things like Go's low memory footprint could give it an important advantage in this area. This looks like a first step in that direction.
Very cool. Considering that the core functionality you want is done in less than 500 LOC (<a href="https://github.com/argusdusty/Ferret/blob/master/ferret.go" rel="nofollow">https://github.com/argusdusty/Ferret/blob/master/ferret.go</a>) it seems a great way to go :) instead of depending on a large external library. OTOH you could have probably just as easily written it in another performant language. But since Go happens to be performant and is already used in other parts of your project things just turned out great for you.
Is this related to the Ferret search engine for Ruby? <a href="https://en.wikipedia.org/wiki/Ferret_search_library" rel="nofollow">https://en.wikipedia.org/wiki/Ferret_search_library</a>
$150/month can buy you a very very good dedicated server.<p><a href="http://www.hetzner.de/en/hosting/produkte_rootserver/ex10" rel="nofollow">http://www.hetzner.de/en/hosting/produkte_rootserver/ex10</a>