Soundex

95 pointsby Ivoahalmost 7 years ago

18 comments

drawkboxalmost 7 years ago

When I worked at eMarketing / Nomadic Agency I used soundex or SOUNDEX() in Microsoft SQL Server many times. Very useful.One big place was on all Kraftfoods sites for search in recipes, products and brand sites. One use was for ingredient lookups from misspellings and search 2000-2008ish, still there at <a href="http://www.kraftrecipes.com/" rel="nofollow">http://www.kraftrecipes.com/</a> on the search function. When you put in 'chiken' you'll get 'chicken' for instance. Pretty useful for misspellings back then and even today.Fun fact we later also used Alta Vista search and even had a Google appliance, back when they made those, for aggregated site searching that tied into all search results across their brand sites. Search would check misspellings which part of that was SOUNDEX() then also aggregate search ingredients, recipes, products and content across their enterprise brand sites using the AV or Google boxes.Another fun fact, kraftfoods sites were one of the first Microsoft .NET production uses. We worked with Microsoft in .NET 1.0 in 2001 to coincide with the release in 2002. We switched them from a combination of Perl sites and Java sites from Java / ATG Dynamo 10+ servers and 20+ Oracle servers to .NET with 3-4 web/app servers and 3 Microsoft SQL Servers.

评论 #17161083 未加载

knadhalmost 7 years ago

Metaphone [1] addresses a lot of issues Soundex has. While Soundex is aimed specifically at names, Metaphone works for all English words.PS: Inspired by Metaphone, I wrote MLphone [2] a phonetic lib for the Malayalam (South India) language. The phonetic keys the algorithm produces are Roman characters though.[1] <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a>[2] <a href="https://nadh.in/code/mlphone/" rel="nofollow">https://nadh.in/code/mlphone/</a>

joezydecoalmost 7 years ago

If you live in Illinois, Wisconsin, or Florida the Soundex code is used to create your drivers license number. You can derive almost anyone’s number if you know their full name and birthdate:<a href="http://www.highprogrammer.com/alan/numbers/dl_us_shared.html" rel="nofollow">http://www.highprogrammer.com/alan/numbers/dl_us_shared.html</a>

inertiaticalmost 7 years ago

I will echo the opinion of others in this and say that in my experience fuzzy matching based on string distance metrics is a better approach in most cases I can think of.I do search related stuff and we use phonetic algorithms for names (in a rather interesting way as well which I haven't seen employed elsewhere) and will occasionally get reports or inquiries of weird unexpected matches, or questions about small typos not producing any of the expected results.I feel these approaches maybe were a better fit for a time where talking was absolutely the main means of communications, but in an era where people communicate more and more by typing things into their phones, any input is frequently a) copied over instead of transcribed or b) first seen written and then typed out by the user, on a small touchscreen keyboard with 1 to 2 typos of letters close to the actual intended letter.I wonder is there such an approach that takes this key distance into account? (ie. in a search for Nock results containing Nick should be higher than Neck)

评论 #17162691 未加载

kastnerkylealmost 7 years ago

I've had good luck with Caverphone for a number of speech specific tasks [0]. There is a python implementation directly in the pdf, I also wrote a version here [1], no idea if it exactly matches the pdf version but it worked for my cases.Followups to soundex such as metaphone are encumbered by license issues as far as I know, but Caverphone is free and clear AFAIK.[2] is an insanely great overview of many of these algorithms, be sure to check it out if you are into this stuff.[0] <a href="https://caversham.otago.ac.nz/files/working/ctp150804.pdf" rel="nofollow">https://caversham.otago.ac.nz/files/working/ctp150804.pdf</a>[1] <a href="https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7bc712eead/" rel="nofollow">https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7b...</a>[2] <a href="http://ntz-develop.blogspot.ca/2011/03/phonetic-algorithms.html" rel="nofollow">http://ntz-develop.blogspot.ca/2011/03/phonetic-algorithms.h...</a>

spc476almost 7 years ago

I used both Soundex and Metaphone to handle URLs for a Bible website: <a href="http://literature.conman.org/bible/" rel="nofollow">http://literature.conman.org/bible/</a> You could type the a URL as:<pre><code> http://bible.conman.org/kj/genasys.1:1 </code></pre> and it would redirect to the proper page:<pre><code> http://bible.conman.org/kj/Genesis.1:1 </code></pre> You have to really misspell something for it to not work properly.

raszalmost 7 years ago

If you are interested in phonetic algorithms, sorting, recognizing and filtering, you will enjoy Talking Banana Twitch ban story by Useless Duck Company : <a href="https://www.youtube.com/watch?v=bJ5ppf0po3k" rel="nofollow">https://www.youtube.com/watch?v=bJ5ppf0po3k</a>

评论 #17160976 未加载

JepZalmost 7 years ago

While I like the idea of Soundex I always had problems with is fixed length. In addition it is limited to the English language and for other languages you might need different algorithms (e.g. Cologne phonetics [1] for German). As others have mentioned, Metaphone [2] is another alternative which got some traction in recent years, but I haven't tried it in a real world scenario yet.For some use-cases n-gram [3] based string comparison might be an option too. It is in no way phonetic (therefore universal for many languages), but when it is just about finding similar words it often produces better results than the original Soundex (mostly due to its length limitation).[1]: <a href="https://en.wikipedia.org/wiki/Cologne_phonetics" rel="nofollow">https://en.wikipedia.org/wiki/Cologne_phonetics</a>[2]: <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a>[3]: <a href="https://en.wikipedia.org/wiki/N-gram" rel="nofollow">https://en.wikipedia.org/wiki/N-gram</a>

kbutleralmost 7 years ago

Years (decades) ago, I read about soundex, and found this little language that came with a soundex module. That was my introduction to Python, which I've used for my master's thesis and in personal and professional development.But Python has since removed the soundex module.

评论 #17160362 未加载

ACow_Adonisalmost 7 years ago

In truth, I've never had much luck with the phonetic algorithms, and i've implemented Caverphone 2, Double metaphone, and NYSIIS [0].Totally subjective, but in my domain I've had better use either using cheaper string distance/similarity metrics (hamming, jaro/winkler, etc), or if you're looking for some kind of resource-saving/fuzzy indexing/blocking type use, an application that uses or extracts ngrams has worked pretty well for me. Your mileage may vary...[0] <a href="https://github.com/DJMelksham/SAS-Data-Linking-Functions" rel="nofollow">https://github.com/DJMelksham/SAS-Data-Linking-Functions</a>

dmlittlealmost 7 years ago

If you're interested in soundex, you should also check out metaphone[1][1] <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a>

dfdashhalmost 7 years ago

A few years back I worked on record linkage projects that relied in part on Soundex. My experience is that Soundex is on the faster side of the phonetic algorithm speed spectrum while (Double) Metaphone is on the other. In the middle are modifications to Soundex or similar approaches like Soundex2, Phonex, and NYSIIS.For those interested I'd highly recommend the work of Peter Christen [1], who does a ton of research in this space. If you want to see some code, check out the implementations of several of these algorithms I wrote a while back [2].[1]: <a href="http://users.cecs.anu.edu.au/~christen/" rel="nofollow">http://users.cecs.anu.edu.au/~christen/</a>[2]: <a href="https://github.com/antzucaro/matchr" rel="nofollow">https://github.com/antzucaro/matchr</a>

da_chickenalmost 7 years ago

I've still got some applications that use SQL Server's SOUNDEX() function for fuzzy name matching. It's not perfect, but it works pretty well for most names. I've used it in a student information system to look for duplicate student entry (happens more often than you'd think).

vszakatsalmost 7 years ago

Such function existed back in Clipper '87, dBase IV, FoxPro and their descendants. Here's a Clipper-compatible implementation in C: <a href="https://github.com/vszakats/harbour-core/blob/master/src/rtl/soundex.c" rel="nofollow">https://github.com/vszakats/harbour-core/blob/master/src/rtl...</a>(Disclaimer: source code author here.)

endrijualmost 7 years ago

A pretty easy way to discover how Soundex works can be playing with <a href="http://gridoc.com/fuzzy-matching/" rel="nofollow">http://gridoc.com/fuzzy-matching/</a> - a tool for fuzzy record matching that supports Soundex and Levenshtein Distance.Disclaimer: I'm the author of the tool.

TeMPOraLalmost 7 years ago

> The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter.Interesting. Isn't this similar to how Hebrew works (or at least the one used in the Bible worked)? I wonder about the rationale (in either case).

评论 #17161524 未加载

codemaniacalmost 7 years ago

My implementation of Soundex in Python - <a href="https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cdf66327a" rel="nofollow">https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...</a>

cakesalmost 7 years ago

I was surprised (years ago) when I learned about SOUNDEX() in Microsoft SQL Server. I always wondered why SOUNDEX was in SQL Server (not that I thought it shouldn't be, was just curious).