Find duplicate rows and keep the one with the highest value in one column

Question

Find duplicate rows and keep the one with the highest value in one column

leroy-1092048

Ten Centuries

Points: 1054
More actions
September 17, 2009 at 3:40 pm

#135156

Hello all,
i have been reading over this site at several De-Duplication postings and articles. They are all good and have been helpful, but i am stuck in one spot.
Here is the scenario:
I have a table called MT_Fact_scanned
CREATE TABLE [dbo].[mt_fact_scanned](
[stats_id] [int] IDENTITY(1,1) NOT NULL,
[recipient_id] [int] NOT NULL,
[time_id] [int] NOT NULL,
[msg_type_id] [int] NOT NULL,
[sender_id] [int] NOT NULL,
[msg_count] [int] NOT NULL,
[msg_size] [bigint] NOT NULL,
[sender_ip] [int] NULL )
What this table does is keep track of messages received by the our application.
If there is a row already existing where recipient_id, time_id, sender_id and sender_IP are the same as what the application was about to insert.. it should NOT have inserted the row, but simply incremented the msg_Count column. but, it incremented the Msg_Count column AND inserted the row. (yeah.. not smart)
Now that we have found this flaw and fixed it, over 400 million duplicated rows now exist in the table.
but here is the trick.. we need to keep the row where the msg_count is the highest and drop the rest. based on the fact that columns - recipient_id, time_id, sender_id and sender_IP should make the row unique.
I can find the Duplicates with:
insert into #DeDupe
SELECT recipient_id, time_id, sender_id, sender_ip
FROM mt_fact_scanned
GROUP BY recipient_id, time_id, sender_id, sender_ip
HAVING COUNT(*) > 1
I can use one of the duplicate rows from the above result and find the one with the highest value in the msg_count Column with:
Select top(1) stats_id from mt_fact_scanned
where recipient_id = 55 and time_id =444 and sender_id = 4324 and sender_ip = 34344
order by msg_count -- (values for where clause provided by previous SQL query)
i can the drop all the duplicate rows by delete the ones without the Stats_id value from the above query like:
delete from mt_fact_scanned
where recipient_id = 55 and time_id =444 and sender_id = 4324 and sender_ip = 34344
and stats_ID != (value from above 'select top(1)' query)
but the real trick, and this is where i am stumped... is how to jam this into one query.
i wrote a VBscript that did this successfully by looping through rows from the Temp table and looping over and over... but at this rate it will take forever.
i was thinking that there was a way in T-SQL to do this task in a larger query to go through and remove the duplicates based on the highest value in the Msg_Count Column
Thanks to all that reply

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

Here's an example that might work. It's completely untested due to the lack of sample data. Therefore, I used a SELECT instead of DELETE statement.

It doesn't require the temp table...

;WITH cte AS (

SELECT max(stats_id) AS max_id, recipient_id, time_id, sender_id, sender_ip

FROM mt_fact_scanned

GROUP BY recipient_id, time_id, sender_id, sender_ip

HAVING COUNT(*) > 1)

SELECT src.*

FROM mt_fact_scanned src

INNER JOIN cte

ON src.recipient_id = cte.recipient_id

AND src.time_id = cte.time_id

AND src.sender_id = cte.sender_id

AND src.sender_ip = cte.sender_ip

WHERE src.stats_id < cte.m_id

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

HansVE Mr or Mrs. 500 Points: 524 More actions · Answer 2

DELETE t

FROM dbo.mt_fact_scanned t JOIN

(SELECT recipient_id, time_id, sender_id, sender_ip, MAX(msg_count) MaxCount

FROM dbo.mt_fact_scanned

GROUP BY recipient_id, time_id, sender_id, sender_ip)x

ON t.recipient_id = x.recipient_id AND t.time_id = x.time_id

AND t.sender_id = x.sender_id AND t.sender_ip = x.sender_ip AND t.msg_count x.MaxCount

Mark Cowne One Orange Chip Points: 26865 More actions · Answer 3

Untested

WITH CTE AS (

SELECT *,

ROW_NUMBER() OVER(PARTITION BY recipient_id, time_id, sender_id, sender_ip ORDER BY msg_Count DESC) AS rn

FROM dbo.MT_Fact_scanned)

DELETE FROM CTE

WHERE rn>1

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 4

HansVE ,

thanks for the help.. actually your script pointed pointed out something to me that was a big help.

in our table it turns out that there are rows where the msg.Count column are equal to each other.

for example 3 identical rows where each have a value of 1 for the message count.

your script... which did exactly what i described in my original post, and I thank you for that, could not be used since in my testing, any rows where the msg.count value were all equal... all would be deleted.

i know i did not point this out in my original post.. because i didn't know it.

but thanks for helping me out.

Leroy

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 5

mark,

thanks for the help... your script worked perfectly!

Even though it turns out that some of the values in the msg_count column were all identical.

your script still was able to deal with this issue.

thanks

Leroy L

leroy-1092048 Ten Centuries Points: 1054 More actions · Answer 6

lmu92,

thanks you for your help.

the script your provided worked as well...

i am curious to see if there is much performance difference between your script and Mark's

i will test and post the results later.

Thanks

Leroy