Enriching duplicate records (not deleting)

  • I need to enrich records which are duplicates before I remove them.

    So no records will be deleted, only records will be enriched with each others data.

    This enriching based on the following rules:

    A record in [prospects] is a duplicate when (check in this order):

    1.If records have the same companyphone value (and companyphone is not empty)

    2.when companyphone is empty ('') then check if companyname is the same.

    3.if no match on companyname, check if cp_email is the same.

    When duplicates are found, it must be made sure that all duplicates share the same values for the columns (except "id" column).

    Records in [prospects] that share the same 'companyname' can have different data (e.g. for companyphone, companyemployees etc). So the data for these duplicate records needs to become the same. For those records the following may be done: the highest value of a certain column may be used for all duplicate prospects. e.g.: if companyname "Microsoft" occurs 3 times, and the value for companyeployees is 100.000, 90.000 and 76000, the value for all these duplicate records may become "100.000". This MAX rule applies on all columns in prospects (except "id" column).

    Normally I'd be glad to get tips on how to contruct the SQL, but time is short, so if anyone can provide me with the working script, I'd be very very thankful! 🙂

  • petervdkerk (10/18/2011)


    I need to enrich records which are duplicates before I remove them.

    So no records will be deleted, only records will be enriched with each others data.

    This enriching based on the following rules:

    A record in [prospects] is a duplicate when (check in this order):

    1.If records have the same companyphone value (and companyphone is not empty)

    2.when companyphone is empty ('') then check if companyname is the same.

    3.if no match on companyname, check if cp_email is the same.

    When duplicates are found, it must be made sure that all duplicates share the same values for the columns (except "id" column).

    Records in [prospects] that share the same 'companyname' can have different data (e.g. for companyphone, companyemployees etc). So the data for these duplicate records needs to become the same. For those records the following may be done: the highest value of a certain column may be used for all duplicate prospects. e.g.: if companyname "Microsoft" occurs 3 times, and the value for companyeployees is 100.000, 90.000 and 76000, the value for all these duplicate records may become "100.000". This MAX rule applies on all columns in prospects (except "id" column).

    Normally I'd be glad to get tips on how to contruct the SQL, but time is short, so if anyone can provide me with the working script, I'd be very very thankful! 🙂

    The database can be found here: http://www.wunderwedding.com/files/prosp_db.zip%5B/quote%5D

    How much do you pay?

  • Are you on freelancer.com?

    If so, you can bid on the project I created fot this: http://www.freelancer.com/projects/1256329.html

  • its a homework.

  • Not sure what you mean, but if you think this is a homework assignment I'd have to disappoint you, I could then easily copy it from one of my fellow students (if only I still were in college 😉

    I'm afraid I need this very fast for a client of mine and I need it done perfect this weekend...(once again: if it were homework I could settle for a few mistakes 😉

  • petervdkerk (10/18/2011)


    I need to ...

    The database can be found here: http://www.wu....ng.com/files/prosp_db.zip

    I know for a fact that my employer (JF Hillebrand) does not want their name and contact data listed in a publicly downloadable 'test database'. And I can only assume that none of the other companies included in there have granted you any rights to put their data online. Can you please take this backup offline right now!



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Put the wrong db there..its offline now! Thank you so much for noticing!

  • Thanks for taking it offline. Please be even more careful with other people's privacy than your own.

    One tip: since 'companyphone' is going to be your first selection criterion, I would recommend to clean the data in that column. Entries like '00000000', '', '+31', or with a town name in it need to be set to null, and you need to look at how the phone numbers were entered. You've got all different variations in there: with and without international access codes, with and without '(0)' before the area codes, with '-', ' ', ' - ' and without anything between area code and number. Help yourself and clean those numbers before trying to match them.

    Plus a question on your criteria. 1 and 2 are exclusive: test companyphone if it is non-empty and 2 and 3 if companyphone is empty. 2 and 3 however are not exclusive, i.e. some prospects can be both a match on name(2) and email(3). The "max-rule" can however not be applied for both matches at the same time. This makes it very hard to come up with a solution for your problem. Can you rephrase criteria 2 and 3 so that they are exclusive?



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Ofcourse, you scared the living **** out of me, but once again: thanks for noticing! Havent got around to check what the actual DB was yet, but better safe than sorry since you're absolutely right: privacy is even more important when handling other people's data! (sometimes) unfortunately I'm only human 😉

    Ok, now back to the good stuff, I altered the rules like so:

    Enriching based on the following rules:

    A record is a duplicate when (check in this order):

    1. If records have the exact same companyphone value and companyphone is not empty (‘’) and companyphone is not null

    2. when companyphone is empty ('') or companyphone is null and companyname is the exact same (case insensitive).

    3. when companyphone is empty ('') or companyphone is null and companyname is NOT the exact same (e.g. typed differently) and cp_email is the same and cp_email not empty(‘’) and cp_email not null.

    I think this does make all 3 rules mutually exclusive.

    Does this help?

  • Actually, 2 and 3 are still not mutually exclusive this way. For rule 1 you can determine whether a row needs to be included by reading only data from this one row (i.e. companyphone). So no problem here describing with simple logical operators what the data needs to match to.

    To determine whether to include a row in rule 2 or 3 however, you need data from this row plus data from another row. So one single row can still be in rule 2 and in rule 3 at the same time, it depends on the other rows contents. You need to adjust your specifications so they are set-based: you need to use operators "there is at least one" or "there is no other" and the like.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • I dont understand how a row can be both in rule 2 and 3 when there's an explicit rule in 2: companyname is EXACT same and rule 3 is the opposite: companyname is NOT the same.

    And are you also saying it is hard (if not impossible) to find duplicates based on more than 1 column?

    Thanks!

  • Here's an example:

    ID, phone name email

    1, null, 'companyA', 'companya@hotmail.com'

    2, null, 'companyA', 'x@companya.com'

    3, null, 'companyB', 'companya@hotmail.com'

    See how the row with ID = 1 fits both your rules 2 and 3 at the same time? Rule 2 applies when you match it to row with ID = 2, and rule 3 applies when you match it to row with ID = 3. Your specs fail to describe to which rows the match needs to be made.

    And to answer your question: No, it is not hard at all to find duplicates on multiple columns, you just need to think over all possible combinations that you do think of as a duplicate and which are not a duplicate.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Ah I see now, thanks!

    Ok, so talked to the customer and the last rule with the check on email is not really necessary...so only rule 1 and rule 2 would apply.

    Now it should be possible to get the correct duplicates right?

  • Did you think of a way to "clean" those phone numbers yet? i.e. '123-4567890' won't be the same number as '1234567890' or '123 - 4567890' if you don't clean them up. Remember, this is just an example, there are many variations like these in your companyphone column that will not be seen as matches even though to the human eye they are clearly duplicates. Without cleaning there will be hardly any matches on companyphone.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Yes, I know that matching exactly is going to be nearly impossible.

    So, spaces and dashes will be removed. That will have to do, anything which after that is still duplicate is just the way it is.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply