TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

SymSpell: 1M times faster spelling correction

203 点作者 mci大约 3 年前

11 条评论

LordGrey大约 3 年前
What they are calling &quot;Symmetric Delete&quot; seems to be the same as an older concept called &quot;deletion neighborhoods&quot;. It is a term coined in an academic paper written by Thomas Bocek, Ela Hunt, and Burkhard Stiller from the University of Zurich, titled &quot;Fast Similarity Search in Large Dictionaries&quot;[1]. The work described there was expanded in a paper written by Daniel Karch, Dennis Luxen, and Peter Sanders from the Karlsruhe Institute of Technology, titled &quot;Improved Fast Similarity Search in Dictionaries&quot;[2]. Both of these papers deal with efficient searching for similar string values, given a query string.<p>LookupCompound and WordSegmentation, algorithms built on Symmetric Delete&#x2F;Deletion Neighborhoods, are pretty interesting.<p>[1]<a href="https:&#x2F;&#x2F;fastss.csg.uzh.ch&#x2F;ifi-2007.02.pdf" rel="nofollow">https:&#x2F;&#x2F;fastss.csg.uzh.ch&#x2F;ifi-2007.02.pdf</a> [2]<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1008.1191v2" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1008.1191v2</a>
评论 #30580553 未加载
kevincox大约 3 年前
This seems very focused on spelling. What I find is key for a good spell correction system is how the words sound. For example I find that Firefox often can&#x27;t find the word I meant for it&#x27;s suggestion list but pasting the misspelt word into Google gets the right result 99% of the time as the one provided option.<p>I wonder how difficult it would be to adapt this to work on sounds or other frequent typos and misspelling sources instead of just characters. It seems it should he possible if you can define a decent &quot;normalization&quot; function.
评论 #30577515 未加载
评论 #30577870 未加载
评论 #30582464 未加载
评论 #30578012 未加载
评论 #30582985 未加载
cb321大约 3 年前
As jamra correctly points out in a sibling comment, the entry point to this (which gets a lot of traction on HN) is indeed attacking a strawman tutorial-written-on-an-airplane-in-Python algorithm. So, the 1M speed-up is very over-hyped.<p>That said, the technique is not wholly without merit, but does carry certain &quot;average-worst case&quot; trade offs related to latency in the memory&#x2F;storage system because of SymSpell&#x27;s reliance upon large hash tables. For details see <a href="https:&#x2F;&#x2F;github.com&#x2F;c-blake&#x2F;suggest" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;c-blake&#x2F;suggest</a><p>EDIT: Also - I am unaware of other implementations even saving the large, slow-to-compute index. The others I am aware of seem to rebuild the large index every time which seems kind of lame. EDIT2 - I guess there is a recent Rust one that is persistent as well as the &quot;mmap &amp; go&quot; Nim one. Still, what should be standard is very rare.
评论 #30581172 未加载
jamra大约 3 年前
This seems very suspicious to me. They’re comparing performance to a tutorial blog post that is extremely inefficient.<p>How about comparing it to. Levenshtein automaton or another state of the art approach?
评论 #30580279 未加载
injidup大约 3 年前
Slightly OT but trying to look up friends names using Android auto speech recognition whilst driving. I have two Austrian friend &quot;Viktor&quot; and &quot;Patric&quot;. If I say. &quot;Hey google, call Viktor&quot; google says. &quot;Sorry you have no Victor in your phone book. Same with Patric. &quot;Sorry you have no Patrick&quot; in your phone book. I&#x27;m surprised that there is not even basic scoring done when looking up names in the phone book with the most likely one offered.<p>*EDIT* I just found a solution to this problem. <a href="https:&#x2F;&#x2F;support.google.com&#x2F;assistant&#x2F;thread&#x2F;559644?hl=en&amp;msgid=58971362" rel="nofollow">https:&#x2F;&#x2F;support.google.com&#x2F;assistant&#x2F;thread&#x2F;559644?hl=en&amp;msg...</a> You can supply a phonetic name and this helps google match. This seems a bit low tech though.
danielscrubs大约 3 年前
It was a bit fun comparing the Haskell version and the C# version:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;wolfgarbe&#x2F;SymSpell&#x2F;blob&#x2F;master&#x2F;SymSpell&#x2F;SymSpell.cs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;wolfgarbe&#x2F;SymSpell&#x2F;blob&#x2F;master&#x2F;SymSpell&#x2F;S...</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;cbeav&#x2F;symspell&#x2F;blob&#x2F;master&#x2F;src&#x2F;SymSpell.hs" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cbeav&#x2F;symspell&#x2F;blob&#x2F;master&#x2F;src&#x2F;SymSpell.h...</a>
评论 #30577344 未加载
nicoburns大约 3 年前
I&#x27;d love to see better open source spell checking. The state of the art spell checkers (in say, MS Word or Google Search) are <i>excellent</i> and more than good enough for my needs. And yet the spell checkers in browsers (e.g. Chrome and Firefox) and other apps tend to be terrible only catching very basic cases and often not being able to suggest the correct word.<p>Does anyone have any insight into what&#x27;s holding this back?
评论 #30580662 未加载
评论 #30582644 未加载
评论 #30580689 未加载
tgv大约 3 年前
So, this is time vs memory?
评论 #30578075 未加载
wodenokoto大约 3 年前
It says it does fuzzy search as well. Wonder how it compares to fzf
tootie大约 3 年前
Is this how modern spell checkers worked? I assumed they were more heuristic at this point. For example, Google&#x27;s &quot;did you mean&quot; is based on mapping common misspellings to what people actually clicked on.
amelius大约 3 年前
Can this be used to match DNA sequences?<p>Or sounds (like Shazam does)?