TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

SymSpell: 1M times faster spelling correction

203 pointsby mciover 3 years ago

11 comments

LordGreyover 3 years ago
What they are calling &quot;Symmetric Delete&quot; seems to be the same as an older concept called &quot;deletion neighborhoods&quot;. It is a term coined in an academic paper written by Thomas Bocek, Ela Hunt, and Burkhard Stiller from the University of Zurich, titled &quot;Fast Similarity Search in Large Dictionaries&quot;[1]. The work described there was expanded in a paper written by Daniel Karch, Dennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology, titled &quot;Improved Fast Similarity Search in Dictionaries&quot;[2]. Both of these papers deal with efficient searching for similar string values, given a query string.<p>LookupCompound and WordSegmentation, algorithms built on Symmetric Delete&#x2F;Deletion Neighborhoods, are pretty interesting.<p>[1]<a href="https:&#x2F;&#x2F;fastss.csg.uzh.ch&#x2F;ifi-2007.02.pdf" rel="nofollow">https:&#x2F;&#x2F;fastss.csg.uzh.ch&#x2F;ifi-2007.02.pdf</a> [2]<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1008.1191v2" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1008.1191v2</a>
评论 #30580553 未加载
kevincoxover 3 years ago
This seems very focused on spelling. What I find is key for a good spell correction system is how the words sound. For example I find that Firefox often can&#x27;t find the word I meant for it&#x27;s suggestion list but pasting the misspelt word into Google gets the right result 99% of the time as the one provided option.<p>I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent &quot;normalization&quot; function.
评论 #30577515 未加载
评论 #30577870 未加载
评论 #30582464 未加载
评论 #30578012 未加载
评论 #30582985 未加载
cb321over 3 years ago
As jamra correctly points out in a sibling comment, the entry point to this (which gets a lot of traction on HN) is indeed attacking a strawman tutorial-written-on-an-airplane-in-Python algorithm. So, the 1M speed-up is very over-hyped.<p>That said, the technique is not wholly without merit, but does carry certain &quot;average-worst case&quot; trade offs related to latency in the memory&#x2F;storage system because of SymSpell&#x27;s reliance upon large hash tables. For details see <a href="https:&#x2F;&#x2F;github.com&#x2F;c-blake&#x2F;suggest" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;c-blake&#x2F;suggest</a><p>EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the &quot;mmap &amp; go&quot; Nim one. Still, what should be standard is very rare.
评论 #30581172 未加载
jamraover 3 years ago
This seems very suspicious to me. They’re comparing performance to a tutorial blog post that is extremely inefficient.<p>How about comparing it to. Levenshtein automaton or another state of the art approach?
评论 #30580279 未加载
injidupover 3 years ago
Slightly OT but trying to look up friends names using Android auto speech recognition whilst driving. I have two Austrian friend &quot;Viktor&quot; and &quot;Patric&quot;. If I say. &quot;Hey google, call Viktor&quot; google says. &quot;Sorry you have no Victor in your phone book. Same with Patric. &quot;Sorry you have no Patrick&quot; in your phone book. I&#x27;m surprised that there is not even basic scoring done when looking up names in the phone book with the most likely one offered.<p>*EDIT* I just found a solution to this problem. <a href="https:&#x2F;&#x2F;support.google.com&#x2F;assistant&#x2F;thread&#x2F;559644?hl=en&amp;msgid=58971362" rel="nofollow">https:&#x2F;&#x2F;support.google.com&#x2F;assistant&#x2F;thread&#x2F;559644?hl=en&amp;msg...</a> You can supply a phonetic name and this helps google match. This seems a bit low tech though.
danielscrubsover 3 years ago
It was a bit fun comparing the Haskell version and the C# version:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;wolfgarbe&#x2F;SymSpell&#x2F;blob&#x2F;master&#x2F;SymSpell&#x2F;SymSpell.cs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;wolfgarbe&#x2F;SymSpell&#x2F;blob&#x2F;master&#x2F;SymSpell&#x2F;S...</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;cbeav&#x2F;symspell&#x2F;blob&#x2F;master&#x2F;src&#x2F;SymSpell.hs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cbeav&#x2F;symspell&#x2F;blob&#x2F;master&#x2F;src&#x2F;SymSpell.h...</a>
评论 #30577344 未加载
nicoburnsover 3 years ago
I&#x27;d love to see better open source spell checking. The state of the art spell checkers (in say, MS Word or Google Search) are <i>excellent</i> and more than good enough for my needs. And yet the spell checkers in browsers (e.g. Chrome and Firefox) and other apps tend to be terrible only catching very basic cases and often not being able to suggest the correct word.<p>Does anyone have any insight into what&#x27;s holding this back?
评论 #30580662 未加载
评论 #30582644 未加载
评论 #30580689 未加载
tgvover 3 years ago
So, this is time vs memory?
评论 #30578075 未加载
wodenokotoover 3 years ago
It says it does fuzzy search as well. Wonder how it compares to fzf
tootieover 3 years ago
Is this how modern spell checkers worked? I assumed they were more heuristic at this point. For example, Google&#x27;s &quot;did you mean&quot; is based on mapping common misspellings to what people actually clicked on.
ameliusover 3 years ago
Can this be used to match DNA sequences?<p>Or sounds (like Shazam does)?