TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Faster Spelling Correction algorithm (2012)

117 pointsby tynover 10 years ago

12 comments

smoyerover 10 years ago
Pre-processing a fixed data-set in order to optimize search is a pretty well known technique (the phone companies have been using &quot;preordering&quot; (specifically Modified Preorder List Traversal [1]) as a means of routing phone calls and diagnosing network issues for a very long time.<p>The downside to any preorder algorithm is that you must rerun the preorder generation code anytime the input changes (in this case a dictionary) and often you must allocate extra storage to hold the pre-processed input.<p>This is a really interesting algorithm but, as always, you have to know the pros and cons and apply it where if fits (i.e. Not somewhere that needs a compressed dictionary).<p>[1] <a href="http://www.codeproject.com/Articles/31669/Hierarchical-Tree-Represented-by-Modified-Preorder" rel="nofollow">http:&#x2F;&#x2F;www.codeproject.com&#x2F;Articles&#x2F;31669&#x2F;Hierarchical-Tree-...</a>
philjohnover 10 years ago
See also the Levenshtein Automatons used by Lucene 4.0 for fuzzy matching ... fun story, although they had to treat some Python code as a &quot;black box&quot; when implementing it ...<p><a href="http://java.dzone.com/news/lucenes-fuzzyquery-100-times" rel="nofollow">http:&#x2F;&#x2F;java.dzone.com&#x2F;news&#x2F;lucenes-fuzzyquery-100-times</a><p>Looks like some interesting research and dev going on in this space at the moment.
TheLoneWolflingover 10 years ago
Hmm...<p>They don&#x27;t mention the approach I&#x27;d take for a naive spellcheck:<p>Generate a (giant) trie from the dictionary. Then perform a DFS on the trie allowing only &lt;the current closest word&#x27;s number of edits&gt; edits from the input word, keeping track of the edits done so far - try no changes at each node first, then a deletion, then a change, then an insertion. This works better with a smaller alphabet size.<p>An optimization: a DAWG is potentially even better space-wise than a hashmap (effectively it ends up compressing the input data), but takes a while to precompute.
dvirskyover 10 years ago
It&#x27;s a nice approach, I&#x27;ve implemented it in Golang a while back. It&#x27;s really fast, but the cost in memory is huge, especially if you store bigrams and not just single words. But it&#x27;s really elegant in its simplicity.<p>BTW - if anyone&#x27;s interested in the Go implementation let me know, I&#x27;ll try to find it and post it somewhere.
robert_tweedover 10 years ago
Interesting, though I&#x27;m curious about whether this type of algorithm is state of the art for accuracy. At this time spell checkers are pretty much fast enough, but aren&#x27;t always terribly good at suggesting replacements. I would imagine that systems based on n-grams are more effective.
评论 #8201920 未加载
discardoramaover 10 years ago
Previous discussions:<p><a href="https://news.ycombinator.com/item?id=4080791" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=4080791</a><p><a href="https://news.ycombinator.com/item?id=8201598" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=8201598</a><p><a href="https://news.ycombinator.com/item?id=7048225" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7048225</a>
taway2012over 10 years ago
To me, it seems to be the same as the Norvig algorithm, except that only deletes are considered. Which allows them to pre-calculate misspellings with modest storage cost.<p>Am I mistaken?
mumrahover 10 years ago
Anyone know how this compares to the FST&#x2F;DFA stuff in Lucene 4.x?
dangover 10 years ago
More or less a repost of <a href="https://news.ycombinator.com/item?id=7048225" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=7048225</a>.
piqufohover 10 years ago
Note; this article is from June 2012.
jherikoover 10 years ago
hmmm... 1000x? how does it scale though?
评论 #8202041 未加载
评论 #8201963 未加载
评论 #8202269 未加载
grogenautover 10 years ago
1000x is still O(1000N) == N. So technically not significantly faster.
评论 #8202549 未加载
评论 #8202571 未加载