How to delete duplicated rows and keep just one?

  • Hello

    I have a table with some duplicate rows that exist! (rows exactly similar, only the ID is different)

    Here is an example

    ID FirstName LastName Email

    -------------------------------

    1 aaa bbb aa@bb

    2 aaa bbb aa@bb

    3 aaa bbb aa@bb

    4 ccc ddd cc@dd

    5 ccc ddd cc@dd

    I need to delete duplicate rows and keep only one instance of each, to give me:

    ID FirstName LastName Email

    -------------------------------

    1 aaa bbb aa@bb

    4 ccc ddd cc@dd

    can you please help? For the moment, I did something like (forget the syntax):

    delete from PERSONS

    where ID in (select P2.ID from persons P2 group by FirstName , LastName, Email

    having count(email)>1

    and count(FirstName)>1

    and count(LastName)>1

    )

    but of course that deleltes all the rows

    Can you please help

    Tks a lot!!

  • -- Assuming ID is unique

    DELETE FROM PERSONS

    WHERE EXISTS (SELECT * FROM PERSONS t2 WHERE t2.FirstName=PERSONS.FirstName

    AND t2.LastName=PERSONS.LastName

    AND t2.Email=PERSONS.Email

    AND t2.ID>PERSONS.ID)

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Are they completely duplicate? All fields? Be sure you keep the one you want if there are differences. Mark's solution should work, but it's not checking which row might get removed. other than the lowest ID.

  • Thank you all for your replies, next time I'll try to write more properly, sorry!

    Yes, I don't care which rows are deleted, I just want to keep one, no matter which ID.

    Thank you!

  • Terry you might want to try this:

    CREATE TABLE #T

    (ID INT

    ,FirstName VARCHAR(10)

    ,LastName VARCHAR(10)

    ,Email VARCHAR(10))

    GO

    INSERT INTO #T

    SELECT 1, 'aaa', 'bbb', 'aa@bb' union all

    SELECT 2, 'aaa', 'bbb', 'aa@bb' union all

    SELECT 3, 'aaa', 'bbb', 'aa@bb' union all

    SELECT 4, 'ccc', 'ddd', 'cc@dd' union all

    SELECT 5, 'ccc', 'ddd', 'cc@dd'

    GO

    ;with numbered as(SELECT rowno=row_number() over

    (partition by FirstName, LastName, Email order by LastName),FirstName,LastName, Email

    from #t)

    SELECT * FROM numbered -- to allow to visually check before deleting would yield:

    rownoFirstName LastName Email

    1 aaa bbb aa@bb

    2 aaa bbb aa@bb

    3 aaa bbb aa@bb

    1 ccc ddd cc@dd

    2 ccc ddd cc@dd

    DELETE from numbered WHERE rowno > 1

    SELECT * FROM #tWhich resluts in:

    IDFirstName LastName Email

    1 aaa bbb aa@bb

    4 ccc ddd cc@dd

    Incase you are not familiar with this the columns compared for duplication are:

    partition by FirstName, LastName, Email

    The Order By colum name is required

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Here's how I recommend doing deduping in SQL 2005/2008:

    ;with CTE (Row) as

    (select row_number() over (partition by FirstName, LastName, Email order by ID)

    from dbo.PERSONS)

    delete from CTE

    where Row > 1;

    Here's a performance comparison between the two methods:

    set nocount on;

    create table #T1 (

    ID int identity primary key,

    Col1 varchar(100));

    insert into #T1 (Col1)

    select newid()

    from dbo.Numbers;

    insert into #T1 (Col1)

    select Col1

    from #T1

    where ID%3 = 0;

    insert into #T1 (Col1)

    select Col1

    from #T1

    where ID%5 = 0;

    insert into #T1 (Col1)

    select Col1

    from #T1

    where ID%11 = 0;

    create table #T2 (

    ID int identity primary key,

    Col1 varchar(100));

    insert into #T2 (Col1)

    select Col1

    from #T1;

    set statistics io on;

    set statistics time on;

    delete from #T1

    where exists

    (select *

    from #T1 T1Sub

    where Col1 = #T1.Col1

    and ID > #T1.ID);

    set statistics time off;

    print '======================================================'

    set statistics time on;

    ;with CTE (Row) as

    (select row_number() over (partition by Col1 order by ID)

    from #T2)

    delete from CTE

    where Row > 1;

    set statistics time off;

    set statistics io off;

    Results:

    Table '#T1'. Scan count 2, logical reads 15139, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 15022, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 94 ms, elapsed time = 199 ms.

    ======================================================

    Table '#T2'. Scan count 1, logical reads 15262, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 62 ms, elapsed time = 95 ms.

    The row_number method is both faster and takes less I/O. Won't matter much on a small table, especially if it's all in RAM like this test, but on a larger table or where some of the data can't be manipulated just with logical reads, it'll make a bigger difference.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • For what it's worth, you can accomplish the same thing with an outer join and an aggreaget function:

    CREATE TABLE #Test(ID int IDENTITY(1,1), FirstName varchar(10), LastName varchar(10), Email varchar(10))

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'

    DELETE FROM a

    FROM#Test a

    LEFT JOIN (SELECT MIN(ID) AS ID FROM #Test GROUP BY FirstName, LastName, Email) b

    ON a.ID = b.ID

    WHEREb.ID IS NULL

    SELECT * FROM #Test

    DROP TABLE #Test

  • dsdeming (10/5/2009)


    For what it's worth, you can accomplish the same thing with an outer join and an aggreaget function:

    CREATE TABLE #Test(ID int IDENTITY(1,1), FirstName varchar(10), LastName varchar(10), Email varchar(10))

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'

    INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'

    DELETE FROM a

    FROM#Test a

    LEFT JOIN (SELECT MIN(ID) AS ID FROM #Test GROUP BY FirstName, LastName, Email) b

    ON a.ID = b.ID

    WHEREb.ID IS NULL

    SELECT * FROM #Test

    DROP TABLE #Test

    Just tested it and that one's indistinguishable (time and I/O) from the row_number one.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply