When I worked at eMarketing / Nomadic Agency I used soundex or SOUNDEX() in Microsoft SQL Server many times. Very useful.<p>One big place was on all Kraftfoods sites for search in recipes, products and brand sites. One use was for ingredient lookups from misspellings and search 2000-2008ish, still there at <a href="http://www.kraftrecipes.com/" rel="nofollow">http://www.kraftrecipes.com/</a> on the search function. When you put in 'chiken' you'll get 'chicken' for instance. Pretty useful for misspellings back then and even today.<p>Fun fact we later also used Alta Vista search and even had a Google appliance, back when they made those, for aggregated site searching that tied into all search results across their brand sites. Search would check misspellings which part of that was SOUNDEX() then also aggregate search ingredients, recipes, products and content across their enterprise brand sites using the AV or Google boxes.<p>Another fun fact, kraftfoods sites were one of the first Microsoft .NET production uses. We worked with Microsoft in .NET 1.0 in 2001 to coincide with the release in 2002. We switched them from a combination of Perl sites and Java sites from Java / ATG Dynamo 10+ servers and 20+ Oracle servers to .NET with 3-4 web/app servers and 3 Microsoft SQL Servers.
Metaphone [1] addresses a lot of issues Soundex has. While Soundex is aimed specifically at names, Metaphone works for all English words.<p>PS: Inspired by Metaphone, I wrote MLphone [2] a phonetic lib for the Malayalam (South India) language. The phonetic keys the algorithm produces are Roman characters though.<p>[1] <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a><p>[2] <a href="https://nadh.in/code/mlphone/" rel="nofollow">https://nadh.in/code/mlphone/</a>
If you live in Illinois, Wisconsin, or Florida the Soundex code is used to create your drivers license number. You can derive almost anyone’s number if you know their full name and birthdate:<p><a href="http://www.highprogrammer.com/alan/numbers/dl_us_shared.html" rel="nofollow">http://www.highprogrammer.com/alan/numbers/dl_us_shared.html</a>
I will echo the opinion of others in this and say that in my experience fuzzy matching based on string distance metrics is a better approach in most cases I can think of.<p>I do search related stuff and we use phonetic algorithms for names (in a rather interesting way as well which I haven't seen employed elsewhere) and will occasionally get reports or inquiries of weird unexpected matches, or questions about small typos not producing any of the expected results.<p>I feel these approaches maybe were a better fit for a time where talking was absolutely the main means of communications, but in an era where people communicate more and more by typing things into their phones, any input is frequently a) copied over instead of transcribed or b) first seen written and then typed out by the user, on a small touchscreen keyboard with 1 to 2 typos of letters close to the actual intended letter.<p>I wonder is there such an approach that takes this key distance into account? (ie. in a search for Nock results containing Nick should be higher than Neck)
I've had good luck with Caverphone for a number of speech specific tasks [0]. There is a python implementation directly in the pdf, I also wrote a version here [1], no idea if it exactly matches the pdf version but it worked for my cases.<p>Followups to soundex such as metaphone are encumbered by license issues as far as I know, but Caverphone is free and clear AFAIK.<p>[2] is an insanely great overview of many of these algorithms, be sure to check it out if you are into this stuff.<p>[0] <a href="https://caversham.otago.ac.nz/files/working/ctp150804.pdf" rel="nofollow">https://caversham.otago.ac.nz/files/working/ctp150804.pdf</a><p>[1] <a href="https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7bc712eead/" rel="nofollow">https://gist.github.com/kastnerkyle/a697d4e762fa8f53c70eea7b...</a><p>[2] <a href="http://ntz-develop.blogspot.ca/2011/03/phonetic-algorithms.html" rel="nofollow">http://ntz-develop.blogspot.ca/2011/03/phonetic-algorithms.h...</a>
I used both Soundex and Metaphone to handle URLs for a Bible website: <a href="http://literature.conman.org/bible/" rel="nofollow">http://literature.conman.org/bible/</a> You could type the a URL as:<p><pre><code> http://bible.conman.org/kj/genasys.1:1
</code></pre>
and it would redirect to the proper page:<p><pre><code> http://bible.conman.org/kj/Genesis.1:1
</code></pre>
You have to <i>really</i> misspell something for it to not work properly.
If you are interested in phonetic algorithms, sorting, recognizing and filtering, you will enjoy Talking Banana Twitch ban story by Useless Duck Company : <a href="https://www.youtube.com/watch?v=bJ5ppf0po3k" rel="nofollow">https://www.youtube.com/watch?v=bJ5ppf0po3k</a>
While I like the idea of Soundex I always had problems with is fixed length. In addition it is limited to the English language and for other languages you might need different algorithms (e.g. Cologne phonetics [1] for German). As others have mentioned, Metaphone [2] is another alternative which got some traction in recent years, but I haven't tried it in a real world scenario yet.<p>For some use-cases n-gram [3] based string comparison might be an option too. It is in no way phonetic (therefore universal for many languages), but when it is just about finding similar words it often produces better results than the original Soundex (mostly due to its length limitation).<p>[1]: <a href="https://en.wikipedia.org/wiki/Cologne_phonetics" rel="nofollow">https://en.wikipedia.org/wiki/Cologne_phonetics</a><p>[2]: <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a><p>[3]: <a href="https://en.wikipedia.org/wiki/N-gram" rel="nofollow">https://en.wikipedia.org/wiki/N-gram</a>
Years (decades) ago, I read about soundex, and found this little language that came with a soundex module. That was my introduction to Python, which I've used for my master's thesis and in personal and professional development.<p>But Python has since removed the soundex module.
In truth, I've never had much luck with the phonetic algorithms, and i've implemented Caverphone 2, Double metaphone, and NYSIIS [0].<p>Totally subjective, but in my domain I've had better use either using cheaper string distance/similarity metrics (hamming, jaro/winkler, etc), or if you're looking for some kind of resource-saving/fuzzy indexing/blocking type use, an application that uses or extracts ngrams has worked pretty well for me. Your mileage may vary...<p>[0] <a href="https://github.com/DJMelksham/SAS-Data-Linking-Functions" rel="nofollow">https://github.com/DJMelksham/SAS-Data-Linking-Functions</a>
If you're interested in soundex, you should also check out metaphone[1]<p>[1] <a href="https://en.wikipedia.org/wiki/Metaphone" rel="nofollow">https://en.wikipedia.org/wiki/Metaphone</a>
A few years back I worked on record linkage projects that relied in part on Soundex. My experience is that Soundex is on the faster side of the phonetic algorithm speed spectrum while (Double) Metaphone is on the other. In the middle are modifications to Soundex or similar approaches like Soundex2, Phonex, and NYSIIS.<p>For those interested I'd highly recommend the work of Peter Christen [1], who does a ton of research in this space. If you want to see some code, check out the implementations of several of these algorithms I wrote a while back [2].<p>[1]: <a href="http://users.cecs.anu.edu.au/~christen/" rel="nofollow">http://users.cecs.anu.edu.au/~christen/</a><p>[2]: <a href="https://github.com/antzucaro/matchr" rel="nofollow">https://github.com/antzucaro/matchr</a>
I've still got some applications that use SQL Server's SOUNDEX() function for fuzzy name matching. It's not perfect, but it works pretty well for most names. I've used it in a student information system to look for duplicate student entry (happens more often than you'd think).
Such function existed back in Clipper '87, dBase IV, FoxPro and their descendants. Here's a Clipper-compatible implementation in C: <a href="https://github.com/vszakats/harbour-core/blob/master/src/rtl/soundex.c" rel="nofollow">https://github.com/vszakats/harbour-core/blob/master/src/rtl...</a><p>(Disclaimer: source code author here.)
A pretty easy way to discover how Soundex works can be playing with <a href="http://gridoc.com/fuzzy-matching/" rel="nofollow">http://gridoc.com/fuzzy-matching/</a> - a tool for fuzzy record matching that supports Soundex and Levenshtein Distance.<p>Disclaimer: I'm the author of the tool.
> <i>The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter.</i><p>Interesting. Isn't this similar to how Hebrew works (or at least the one used in the Bible worked)? I wonder about the rationale (in either case).
My implementation of Soundex in Python - <a href="https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cdf66327a" rel="nofollow">https://gist.github.com/codemaniac/4b23ea0b324a25c580b1192cd...</a>
I was surprised (years ago) when I learned about SOUNDEX() in Microsoft SQL Server. I always wondered why SOUNDEX was in SQL Server (not that I thought it shouldn't be, was just curious).