Question about using SoundEx to find Client Name Duplication

  •  

    I have a Client list of about 175,000 plus records.  It is not my own.  We are sure there are repeated clients, (e.g. A & A Plumbing, A and A Plumbing, A and A Plumbing Inc., A & A Plumbing, Incorporated, etc). 

    I have written a routine that parses out each word in each record.  I then use SoundEx.  I then parse out each SoundEx return and use the ASCII to get a numeric value.  I then add the numeric values until all of the words are completed for each record. 

    Note: I have a list of REPLACE for things like Inc, Incorp, Incorporated, etc since that would only complicate this worse. 

    I am getting a return set of about 50,000 records; some are look to be similar, some are not close at all. 

    Has anyone done a SQL routine that tries to capture these kinds of occurences and if so, what approach did you take?  If not, does anyone have a suggestion as to how I could further refine these 50,000 records? 

    Thank you. 

    I wasn't born stupid - I had to study.

  • I has done a SQL routing that tries to capture Malyutin Slava and Malyutin Salava. That is my name (mSlava). But russian name has third-part, therefore it routing tries to capture Malyutin Slava Nikolaevich. In my script I can set a number of errors. If you want I can to e-mail it. 

  • SOUNDEX on SQL Server is too limiting.

    You need an equivalent that takes into account more letters so that you can tune your dedupe process.

    You also need to be able to SOUNDEX individual words in your name and count the matching words.

    Personally I use the Lawrence Philips Metaphone function within a VB app.  Metaphone blows SOUNDEX into the weeds because it is concerned with the phonetic sound of a word rather than a straigh letter substitution.

    SOUNDEX will tell you that fillip and Philip are different.  METAPHONE will tell you that they sound the same and are probably a match.

  • Thank you!  I agree!  SoundEx is a pain.  I have a long list of REPLACE statements.  I've got the code to break the words apart using looping with CharIndex, so that is not a problem.  And that is why I was trying to assign numeric values to the SoundEx returns in hopes of having some kind of match.  

    I only have a SQL Server to do this upon, so I am very limited in my use of outside applications. 

    Thanks for replying though. 

    (If I get a good solution, I will do what I can to post it)

    I wasn't born stupid - I had to study.

  • Hi Farrell,

    Did you ever find a solution to your problem? I'm asking because I'm looking for something pretty much the same as what you are looking for.

    Thanks,

    Grant

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply