Dedupe Within SSIS (Remove duplicates)

  • Hi,

    Can you help me with this SSIS 2008 package pls.

    I have a Library Data warehouse with a cutomer table – DimCustomer in it. A customer flat file extract comes from each of the branches of the library on a weekly basis. A customer could register in multiple branches – hence come as part of multiple files - so need to make sure only one instance of that customer ends up in DimCustomer table.

    Here are the 2 things i need to implement whist loading this customer table:

    1. 1st I need to dedupe within each of the files themselves – in case a branch sends the same customer (LibID, name & address) multiple times in the same file.

    2. All already generated existing DimCustomerKEYs and the corresponding LibIDs are stored in a customer lookup table. So, the next step is to compare the extract against that lookup table for a match to determine if they are existing customer or new. If they fail, they will also be fuzzy matched against the DimCustomer table on their name & address. Depending on the outcome, I will either insert a new row in DimCustomer or update their existing record and the lookup table with the new LibID they’ve come with.

    I am pretty confident with using SCD part of the package. Just need help with the de-duping especially (step 1 above) and fuzzy lookup.

    Infact a shell of what the package should look like would be much appreciated.

    Thank you.

  • Have you looked at MergeJoin task with Conditional Split?

    You supply a flow from two sources and check for the links. Then use a conditional split task to flow the data where you need it to go

  • Could be too simple an approach (ie your problem could be more than I read into it) but why not simply aggregate? A group by of the key fields (ID and Address) would eliminate duplicates implicitly.

    The main issue I would see using this approach is if a single ID was supplied with >1 address (ie steve_id, 123 West St and steve_id, 456 East St), however, this issue is more than a technical one, it requires business input because without that, your choice of which one is 'correct' is arbitrary.

    Steve.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply