TechEcho

12 comments

smoyerover 10 years ago

Pre-processing a fixed data-set in order to optimize search is a pretty well known technique (the phone companies have been using "preordering" (specifically Modified Preorder List Traversal [1]) as a means of routing phone calls and diagnosing network issues for a very long time.The downside to any preorder algorithm is that you must rerun the preorder generation code anytime the input changes (in this case a dictionary) and often you must allocate extra storage to hold the pre-processed input.This is a really interesting algorithm but, as always, you have to know the pros and cons and apply it where if fits (i.e. Not somewhere that needs a compressed dictionary).[1] <a href="http://www.codeproject.com/Articles/31669/Hierarchical-Tree-Represented-by-Modified-Preorder" rel="nofollow">http://www.codeproject.com/Articles/31669/Hierarchical-Tree-...</a>

philjohnover 10 years ago

See also the Levenshtein Automatons used by Lucene 4.0 for fuzzy matching ... fun story, although they had to treat some Python code as a "black box" when implementing it ...<a href="http://java.dzone.com/news/lucenes-fuzzyquery-100-times" rel="nofollow">http://java.dzone.com/news/lucenes-fuzzyquery-100-times</a>Looks like some interesting research and dev going on in this space at the moment.

TheLoneWolflingover 10 years ago

Hmm...They don't mention the approach I'd take for a naive spellcheck:Generate a (giant) trie from the dictionary. Then perform a DFS on the trie allowing only <the current closest word's number of edits> edits from the input word, keeping track of the edits done so far - try no changes at each node first, then a deletion, then a change, then an insertion. This works better with a smaller alphabet size.An optimization: a DAWG is potentially even better space-wise than a hashmap (effectively it ends up compressing the input data), but takes a while to precompute.

dvirskyover 10 years ago

It's a nice approach, I've implemented it in Golang a while back. It's really fast, but the cost in memory is huge, especially if you store bigrams and not just single words. But it's really elegant in its simplicity.BTW - if anyone's interested in the Go implementation let me know, I'll try to find it and post it somewhere.

robert_tweedover 10 years ago

Interesting, though I'm curious about whether this type of algorithm is state of the art for accuracy. At this time spell checkers are pretty much fast enough, but aren't always terribly good at suggesting replacements. I would imagine that systems based on n-grams are more effective.

评论 #8201920 未加载

discardoramaover 10 years ago

Previous discussions:<a href="https://news.ycombinator.com/item?id=4080791" rel="nofollow">https://news.ycombinator.com/item?id=4080791</a><a href="https://news.ycombinator.com/item?id=8201598" rel="nofollow">https://news.ycombinator.com/item?id=8201598</a><a href="https://news.ycombinator.com/item?id=7048225" rel="nofollow">https://news.ycombinator.com/item?id=7048225</a>

taway2012over 10 years ago

To me, it seems to be the same as the Norvig algorithm, except that only deletes are considered. Which allows them to pre-calculate misspellings with modest storage cost.Am I mistaken?

mumrahover 10 years ago

Anyone know how this compares to the FST/DFA stuff in Lucene 4.x?

dangover 10 years ago

More or less a repost of <a href="https://news.ycombinator.com/item?id=7048225" rel="nofollow">https://news.ycombinator.com/item?id=7048225</a>.

piqufohover 10 years ago

Note; this article is from June 2012.

jherikoover 10 years ago

hmmm... 1000x? how does it scale though?

评论 #8202041 未加载

评论 #8201963 未加载

评论 #8202269 未加载

grogenautover 10 years ago

1000x is still O(1000N) == N. So technically not significantly faster.

评论 #8202549 未加载

评论 #8202571 未加载

12 comments

smoyerover 10 years ago

philjohnover 10 years ago

TheLoneWolflingover 10 years ago

dvirskyover 10 years ago

robert_tweedover 10 years ago

评论 #8201920 未加载

discardoramaover 10 years ago

taway2012over 10 years ago

To me, it seems to be the same as the Norvig algorithm, except that only deletes are considered. Which allows them to pre-calculate misspellings with modest storage cost.Am I mistaken?

mumrahover 10 years ago

Anyone know how this compares to the FST/DFA stuff in Lucene 4.x?

dangover 10 years ago

More or less a repost of <a href="https://news.ycombinator.com/item?id=7048225" rel="nofollow">https://news.ycombinator.com/item?id=7048225</a>.

Faster Spelling Correction algorithm (2012)

12 comments

Faster Spelling Correction algorithm (2012)

12 comments