Removing partially duplicate records in SQL

  • What about this sample data?

    SELECT'', '', 'Y', 3 UNION ALL

    SELECT'', 'X', 'Y', 3

    Should first line be deleted since it is a subset of second line?


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks Peso,

    Yes, The first line should be deleted. Thanks for your codes. I am still trying to understand it.

  • Steve Jones - Editor (10/15/2008)


    Multiple passes would be looking for a pattern like Row 2, where Col1 and 2 match, but Col3 is a null/blank. Then a second pattern would be col1 and 3 match, but 2 has a null/blank.

    Max (10/15/2008)


    Yes, The first line should be deleted. Thanks for your codes. I am still trying to understand it.

    In that case, use this more generic one step delete algorithm

    DECLARE@Sample TABLE

    (

    recID INT IDENTITY(1, 1),

    col1 VARCHAR(1),

    col2 VARCHAR(2),

    col3 VARCHAR(3),

    userID INT

    )

    INSERT@Sample

    SELECT'A', 'B', 'C', 1 UNION ALL

    SELECT'A', 'B', '', 1 UNION ALL

    SELECT'A', '', 'C', 1 UNION ALL

    SELECT'F', '', 'C', 1 UNION ALL

    SELECT'', 'M', '', 2 UNION ALL

    SELECT'T', 'M', 'O', 2 UNION ALL

    SELECT'', 'M', 'O', 2 UNION ALL

    SELECT'X', 'M', 'O', 2 UNION ALL

    SELECT'', '', 'Y', 3 UNION ALL

    SELECT'', 'X', 'Y', 3

    SELECT*

    FROM@Sample

    DELETEs

    FROM(

    SELECTrecID,

    col1,

    col2,

    col3,

    userID,

    CASE WHEN col1 = '' THEN 0 ELSE 1 END

    + CASE WHEN col2 = '' THEN 0 ELSE 1 END

    + CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems

    FROM@Sample

    ) AS s

    INNER JOIN(

    SELECTMIN(recID) AS minID,

    col1,

    col2,

    col3,

    userID,

    CASE WHEN col1 = '' THEN 0 ELSE 1 END

    + CASE WHEN col2 = '' THEN 0 ELSE 1 END

    + CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems

    FROM@Sample

    GROUP BYcol1,

    col2,

    col3,

    userID,

    CASE WHEN col1 = '' THEN 0 ELSE 1 END

    + CASE WHEN col2 = '' THEN 0 ELSE 1 END

    + CASE WHEN col3 = '' THEN 0 ELSE 1 END

    ) AS w ON w.userID = s.userID

    AND w.minID <> s.recID

    AND w.theItems >= s.theItems

    WHEREw.col1 = COALESCE(NULLIF(s.col1, ''), w.col1)

    AND w.col2 = COALESCE(NULLIF(s.col2, ''), w.col2)

    AND w.col3 = COALESCE(NULLIF(s.col3, ''), w.col3)

    SELECT*

    FROM@Sample

    Also see http://weblogs.sqlteam.com/peterl/archive/2008/10/15/Delete-all-subset-records.aspx


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks Peso. You're brilliant. The codes look really great. I haven't fully tested it but I believe it will work.

    Thanks a lot all others who participated. Have a good day!!

  • It turns out you can write the algorithm even cleaner.

    I got a comment on my blog to use an algorithm like this instead.

    DELETEa

    FROM@Sample AS a

    INNER JOIN@Sample AS b ON b.userID = a.userID

    AND b.recID <> a.recID

    WHERE(a.col1 = b.col1 OR a.col1 = '')

    AND (a.col2 = b.col2 OR a.col2 = '')

    AND (a.col3 = b.col3 OR a.col3 = '')

    It's very nice but it creates several cross joins for each userID (n * m).

    I tried to create a triangular join with help of the "count of values" column ( n * m / 2).


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks Peso,

    The new algorithm looks good, but I need to make sure There are no pure duplicates. If there are, it will delete them all.

    INSERT @SAMPLE

    SELECT 'A', 'B', 'C', 1 UNION ALL <-- duplicate

    SELECT 'A', 'B', 'C', 1 UNION ALL <-- duplicate

    SELECT 'A', 'B', ' ', 1 UNION ALL

    SELECT 'A', ' ', 'C', 1 UNION ALL

    SELECT 'F', ' ', 'C', 1 UNION ALL

    SELECT ' ', 'M', ' ', 2 UNION ALL

    SELECT 'T', 'M', 'O', 2 UNION ALL

    SELECT ' ', 'M', 'O', 2 UNION ALL

    SELECT 'X', 'M', 'O', 2 UNION ALL

    SELECT ' ', ' ', 'Y', 3 UNION ALL

    SELECT ' ', 'X', 'Y', 3

    Thanks.

    p.s. Maybe I can group it by col1, col2, and col3 again and select only one of them to delete. I will run some tests. Thanks.

  • Easy fixed when you understand the code.

    DELETEa

    FROM@Sample AS a

    INNER JOIN(

    SELECTMIN(recID) AS recID,

    col1,

    col2,

    col3,

    userID

    FROM@Sample

    GROUP BYcol1,

    col2,

    col3,

    userID

    ) AS b ON b.userID = a.userID

    AND b.recID <> a.recID

    WHERE(a.col1 = b.col1 OR a.col1 = '')

    AND (a.col2 = b.col2 OR a.col2 = '')

    AND (a.col3 = b.col3 OR a.col3 = '')

    And now the code looks very similar to my first suggestion. The thing missing is the WHERE '' NOT IN part 😀


    N 56°04'39.16"
    E 12°55'05.25"

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply