Tag Archives: algorithms

Can you hear the sound? A brief introduction to Soundex.

soundexOne of the common and seemingly simple problems in software development is to compare two strings. Assume that we are writing a contact list application and trying to prevent the same person from being added to the contacts more than once. Let’s say the operator tried to add a new name, new_name into the database. Consider the following Python-looking pseudo code:

for existing_name in existing_names:
  if existing_name == new_name:
    print "Error adding " + new_name + ": Person already exists"

That was really simple, right? What happens if we try to add a “JOHN SMITH” into the database, but there already exists a “John Smith”? It will get added, because the interpreter thinks that they are different names, because equality operator recognizes uppercase and lowercase versions of the same letter as different. Let’s modify our code to be smarter (actually dumber) about uppercase and lowercase:

for existing_name in existing_names:
  if existing_name.lower() == new_name.lower():
    print "Error adding " + new_name + ": Person already exists"

That instructed the computer to convert everything (both new name and the existing name) into lower case, then compare the two strings. That way, “JOHN SMITH”, “John Smith”, or even “jOhN SMiTh” all become “john smith” before comparison. Nice. It even compiles, so ship it. Wait! Here is our first customer complaint:

Dear Developer,
Your product is great, still I have a little problem. My database is full of duplicate names, some mistyped by our data entry staff. An example list is attached. Can you please send us an updated version that prevents this from happening?

Example names:
Chelsea McAlister
Chelsie McAlister
Chelsey MacAllister
Chelsee MacAlister

Your Customer

What should you do? You can reduce it a bit by educating the data entry staff, but it is impossible to prevent it from happening altogether. There must be a way that you can implement such that the application detects and warns about these collisions. What is the common thing between these names? They have many letters in common, but not all. Sometimes the length of the strings are the same, sometimes they are different. But why does a human think that they are the same person? Because although they are written differently, they are pronounced the same, or at least very similar. Here is our clue! They sound the same!

Continue reading