Look up duplicates

  • I have a table of customer names and ids, and I want to identify the duplicate names in a new table

    The customer table has records like so:

    ID -- Name

    1275 -- Customer A

    3472 -- Customer A

    2812 -- Customer A

    1245 -- Customer B

    1544 -- Customer C

    2567 -- Customer D

    3446 -- Customer D

    So, I have 3 duplicates for Customer A and 2 for Customer D. I want to take the first as the "master". The output table should look like this:

    Master ID -- Slave ID

    1275 -- 3472

    1275 -- 2812

    2567 -- 3446

    And so on. (Each duplicate will cause one record to be written to the output table).

    I'm using Excel at the moment, but I could switch to SQL. What is the easiest way to do this?

  • How's this? Not sure how it'll scale, but it runs pretty much instantly for me on a few thousand records.

    From a design point of view, an extra column in the Customer table with a self-referencing FK might be another solution, rather than an extra table.

    Cheers

    WITH CTE_Dups AS

    (

    SELECT CustomerId, CustomerName, ROW_NUMBER() OVER (PARTITION BY CustomerName ORDER BY CustomerId) AS RN

    from Customer

    )

    INSERT TargetTable (MasterId, SlaveId)

    SELECT cd.CustomerId AS MasterId, cd2.CustomerId AS SlaveId

    FROM CTE_Dups cd

    INNER JOIN CTE_Dups cd2 ON cd2.CustomerName = cd.CustomerName AND cd2.RN > 1

    WHERE cd.RN = 1

    GO

  • That looks good, however I am using SQLite (www.sqlite.org) and it doesn't seem to work with it.

    Would I need MS SQL 2008 R2 Express to run it?

  • Yes, it's using MS T-SQL specific functions. I'm not familiar with SQLite so not sure how to translate it into that dialect, sorry.

  • This could be easier to translate (if even needed)

    DECLARE @tTabletable(

    IDint,

    namevarchar( 20));

    INSERT INTO @tTable

    SELECT 1275, 'Customer A' UNION ALL

    SELECT 3472, 'Customer A' UNION ALL

    SELECT 2812, 'Customer A' UNION ALL

    SELECT 1245, 'Customer B' UNION ALL

    SELECT 1544, 'Customer C' UNION ALL

    SELECT 2567, 'Customer D' UNION ALL

    SELECT 3446, 'Customer D' ;

    SELECT m.Master, t.ID FROM @tTable t

    JOIN ( SELECT MIN(ID) AS Master,

    name

    FROM @tTable

    GROUP BY name) M ON t.name = m.name AND m.Master <> t.ID;

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks Luis, that works well.

  • Luis Cazares (7/30/2012)


    SELECT m.Master, t.ID FROM @tTable t

    JOIN ( SELECT MIN(ID) AS Master,

    name

    FROM @tTable

    GROUP BY name) M ON t.name = m.name AND m.Master <> t.ID;

    [/code]

    Would you mind explaining what is going on here? It works but I'm not sure why!

  • tmccar (7/30/2012)


    Luis Cazares (7/30/2012)


    SELECT m.Master, t.ID FROM @tTable t

    JOIN ( SELECT MIN(ID) AS Master,

    name

    FROM @tTable

    GROUP BY name) M ON t.name = m.name AND m.Master <> t.ID;

    [/code]

    Would you mind explaining what is going on here? It works but I'm not sure why!

    It is creating a derived table of minimum id's when grouped by name. So you get your "master" id and the associated name in a "virtual" table. You are then joining that back to the main table to include that value with each record from the main table and excluding those rows where the master matches the actual id.

    Jared
    CE - Microsoft

  • Thanks Jared, that explains it.

  • Thanks Jared, I think I couldn't have explained that well.

    Of course, I know what I'm doing but I have problems to explain when I see some things that might look simple to me.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (7/30/2012)


    Thanks Jared, I think I couldn't have explained that well.

    Of course, I know what I'm doing but I have problems to explain when I see some things that might look simple to me.

    I'm sure you could have explained it just as well 🙂

    Jared
    CE - Microsoft

  • Hi Jared, Luis

    "...It is creating a derived table of minimum id's when grouped by name."

    Instead of creating the derived table sorted by minimum ids, how could I change it so that it takes the ID that corresponds to the first occurence of "Name" in the table?

    Thanks

  • tmccar (7/30/2012)


    Hi Jared, Luis

    "...It is creating a derived table of minimum id's when grouped by name."

    Instead of creating the derived table sorted by minimum ids, how could I change it so that it takes the ID that corresponds to the first occurence of "Name" in the table?

    Thanks

    How do you define the "first occurrence?" There is no such thing in SQL Server without an ORDER BY or some other grouping. In this case it is grouping by the name and then finding the minimum id. How do you define first occurrence?

    Jared
    CE - Microsoft

  • How would you define "the first ocurrence"?

    I'm not sure if this is the best place to search for that answer, as the best solutions would be oriented on MS SQL Server.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • I imported an Excel file, so I was looking at that. It's sorted by name, then by date (newest to oldest) so that the first occurrence of each customer is the newest. (I have a date field in this file, if that would help)

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply