Can you hear the sound? A brief introduction to Soundex.

soundexOne of the common and seemingly simple problems in software development is to compare two strings. Assume that we are writing a contact list application and trying to prevent the same person from being added to the contacts more than once. Let’s say the operator tried to add a new name, new_name into the database. Consider the following Python-looking pseudo code:

for existing_name in existing_names:
  if existing_name == new_name:
    print "Error adding " + new_name + ": Person already exists"

That was really simple, right? What happens if we try to add a “JOHN SMITH” into the database, but there already exists a “John Smith”? It will get added, because the interpreter thinks that they are different names, because equality operator recognizes uppercase and lowercase versions of the same letter as different. Let’s modify our code to be smarter (actually dumber) about uppercase and lowercase:

for existing_name in existing_names:
  if existing_name.lower() == new_name.lower():
    print "Error adding " + new_name + ": Person already exists"

That instructed the computer to convert everything (both new name and the existing name) into lower case, then compare the two strings. That way, “JOHN SMITH”, “John Smith”, or even “jOhN SMiTh” all become “john smith” before comparison. Nice. It even compiles, so ship it. Wait! Here is our first customer complaint:

Dear Developer,
Your product is great, still I have a little problem. My database is full of duplicate names, some mistyped by our data entry staff. An example list is attached. Can you please send us an updated version that prevents this from happening?

Example names:
Chelsea McAlister
Chelsie McAlister
Chelsey MacAllister
Chelsee MacAlister

Your Customer

What should you do? You can reduce it a bit by educating the data entry staff, but it is impossible to prevent it from happening altogether. There must be a way that you can implement such that the application detects and warns about these collisions. What is the common thing between these names? They have many letters in common, but not all. Sometimes the length of the strings are the same, sometimes they are different. But why does a human think that they are the same person? Because although they are written differently, they are pronounced the same, or at least very similar. Here is our clue! They sound the same!

Enter Soundex

Soundex algorithms are algorithms that take a word as input, and produce a string that is a phonetic representation of it. The rules are based on how a human positions their lips and tongue while saying a word. If we compare the soundex representations of two strings, we have a better chance of catching duplicates. For example, the names above will translate to:

Chelsea McAlister = C420 M242
Chelsie McAlister = C420 M242
Chelsey MacAllister = C420 M242
Chelsee MacAlister = C420 M242

Impressive, huh? Then we can revise our code like the following:

for existing_name in existing_names:
  if soundex(existing_name) == soundex(new_name):
    print "Error adding " + new_name + ": Person already exists"

Note that we don’t need to convert to lower case anymore, as soundex result does not change with case. Also, please note that this is a pseude-code to illustrate the use, that is why we did not compare name and surname separately (that we actually should in the production code).

There are several more phonetic algorithms that improves on Soundex, like Metaphone. Although the idea is the same, it uses English-specific variances and inconsistencies. See Wikipedia entries of Soundex and Metaphone for more information.

Leave a Reply

Your email address will not be published. Required fields are marked *