How to delete duplicated rows and keep just one?

Question

Post reply

How to delete duplicated rows and keep just one?

Terry-380449

Default port

Points: 1458
More actions
October 5, 2009 at 8:27 am

#134279

Hello
I have a table with some duplicate rows that exist! (rows exactly similar, only the ID is different)
Here is an example
ID FirstName LastName Email
-------------------------------
1 aaa bbb aa@bb
2 aaa bbb aa@bb
3 aaa bbb aa@bb
4 ccc ddd cc@dd
5 ccc ddd cc@dd
I need to delete duplicate rows and keep only one instance of each, to give me:
ID FirstName LastName Email
-------------------------------
1 aaa bbb aa@bb
4 ccc ddd cc@dd
can you please help? For the moment, I did something like (forget the syntax):
delete from PERSONS
where ID in (select P2.ID from persons P2 group by FirstName , LastName, Email
having count(email)>1
and count(FirstName)>1
and count(LastName)>1
)
but of course that deleltes all the rows
Can you please help
Tks a lot!!

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Mark Cowne One Orange Chip Points: 26865 More actions · Answer 1

-- Assuming ID is unique

DELETE FROM PERSONS

WHERE EXISTS (SELECT * FROM PERSONS t2 WHERE t2.FirstName=PERSONS.FirstName

AND t2.LastName=PERSONS.LastName

AND t2.Email=PERSONS.Email

AND t2.ID>PERSONS.ID)

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 2

Are they completely duplicate? All fields? Be sure you keep the one you want if there are differences. Mark's solution should work, but it's not checking which row might get removed. other than the lowest ID.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Terry-380449 Default port Points: 1458 More actions · Answer 3

Thank you all for your replies, next time I'll try to write more properly, sorry!

Yes, I don't care which rows are deleted, I just want to keep one, no matter which ID.

Thank you!

Ron McCullough SSC Guru Points: 63877 More actions · Answer 4

Terry you might want to try this:

CREATE TABLE #T

(ID INT

,FirstName VARCHAR(10)

,LastName VARCHAR(10)

,Email VARCHAR(10))

GO

INSERT INTO #T

SELECT 1, 'aaa', 'bbb', 'aa@bb' union all

SELECT 2, 'aaa', 'bbb', 'aa@bb' union all

SELECT 3, 'aaa', 'bbb', 'aa@bb' union all

SELECT 4, 'ccc', 'ddd', 'cc@dd' union all

SELECT 5, 'ccc', 'ddd', 'cc@dd'

GO

;with numbered as(SELECT rowno=row_number() over

(partition by FirstName, LastName, Email order by LastName),FirstName,LastName, Email

from #t)

SELECT * FROM numbered -- to allow to visually check before deleting would yield:

rownoFirstName LastName Email

1 aaa bbb aa@bb

2 aaa bbb aa@bb

3 aaa bbb aa@bb

1 ccc ddd cc@dd

2 ccc ddd cc@dd

DELETE from numbered WHERE rowno > 1

SELECT * FROM #tWhich resluts in:

IDFirstName LastName Email

1 aaa bbb aa@bb

4 ccc ddd cc@dd

Incase you are not familiar with this the columns compared for duplication are:

partition by FirstName, LastName, Email

The Order By colum name is required

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

GSquared SSC Guru Points: 260824 More actions · Answer 5

Here's how I recommend doing deduping in SQL 2005/2008:

;with CTE (Row) as

(select row_number() over (partition by FirstName, LastName, Email order by ID)

from dbo.PERSONS)

delete from CTE

where Row > 1;

Here's a performance comparison between the two methods:

set nocount on;

create table #T1 (

ID int identity primary key,

Col1 varchar(100));

insert into #T1 (Col1)

select newid()

from dbo.Numbers;

insert into #T1 (Col1)

select Col1

from #T1

where ID%3 = 0;

insert into #T1 (Col1)

select Col1

from #T1

where ID%5 = 0;

insert into #T1 (Col1)

select Col1

from #T1

where ID%11 = 0;

create table #T2 (

ID int identity primary key,

Col1 varchar(100));

insert into #T2 (Col1)

select Col1

from #T1;

set statistics io on;

set statistics time on;

delete from #T1

where exists

(select *

from #T1 T1Sub

where Col1 = #T1.Col1

and ID > #T1.ID);

set statistics time off;

print '======================================================'

set statistics time on;

;with CTE (Row) as

(select row_number() over (partition by Col1 order by ID)

from #T2)

delete from CTE

where Row > 1;

set statistics time off;

set statistics io off;

Results:

Table '#T1'. Scan count 2, logical reads 15139, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 15022, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 199 ms.
======================================================
Table '#T2'. Scan count 1, logical reads 15262, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 62 ms, elapsed time = 95 ms.

The row_number method is both faster and takes less I/O. Won't matter much on a small table, especially if it's all in RAM like this test, but on a larger table or where some of the data can't be manipulated just with logical reads, it'll make a bigger difference.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

dsdeming Ten Centuries Points: 1236 More actions · Answer 6

For what it's worth, you can accomplish the same thing with an outer join and an aggreaget function:

CREATE TABLE #Test(ID int IDENTITY(1,1), FirstName varchar(10), LastName varchar(10), Email varchar(10))

INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'

INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'

DELETE FROM a

FROM#Test a

LEFT JOIN (SELECT MIN(ID) AS ID FROM #Test GROUP BY FirstName, LastName, Email) b

ON a.ID = b.ID

WHEREb.ID IS NULL

SELECT * FROM #Test

DROP TABLE #Test

GSquared SSC Guru Points: 260824 More actions · Answer 7

dsdeming (10/5/2009)
For what it's worth, you can accomplish the same thing with an outer join and an aggreaget function:
CREATE TABLE #Test(ID int IDENTITY(1,1), FirstName varchar(10), LastName varchar(10), Email varchar(10))
INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'
INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'
INSERT INTO #Test( FirstName, LastName, Email) SELECT 'aaa', 'bb', 'aa@bb'
INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'
INSERT INTO #Test( FirstName, LastName, Email) SELECT 'ccc', 'dd', 'cc@dd'
DELETE FROM a
FROM#Test a
LEFT JOIN (SELECT MIN(ID) AS ID FROM #Test GROUP BY FirstName, LastName, Email) b
ON a.ID = b.ID
WHEREb.ID IS NULL
SELECT * FROM #Test
DROP TABLE #Test

Just tested it and that one's indistinguishable (time and I/O) from the row_number one.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon