Find duplicate rows and keep the one with the highest value in one column

  • Hello all,

    i have been reading over this site at several De-Duplication postings and articles. They are all good and have been helpful, but i am stuck in one spot.

    Here is the scenario:

    I have a table called MT_Fact_scanned

    CREATE TABLE [dbo].[mt_fact_scanned](

    [stats_id] [int] IDENTITY(1,1) NOT NULL,

    [recipient_id] [int] NOT NULL,

    [time_id] [int] NOT NULL,

    [msg_type_id] [int] NOT NULL,

    [sender_id] [int] NOT NULL,

    [msg_count] [int] NOT NULL,

    [msg_size] [bigint] NOT NULL,

    [sender_ip] [int] NULL )

    What this table does is keep track of messages received by the our application.

    If there is a row already existing where recipient_id, time_id, sender_id and sender_IP are the same as what the application was about to insert.. it should NOT have inserted the row, but simply incremented the msg_Count column. but, it incremented the Msg_Count column AND inserted the row. (yeah.. not smart)

    Now that we have found this flaw and fixed it, over 400 million duplicated rows now exist in the table.

    but here is the trick.. we need to keep the row where the msg_count is the highest and drop the rest. based on the fact that columns - recipient_id, time_id, sender_id and sender_IP should make the row unique.

    I can find the Duplicates with:

    insert into #DeDupe

    SELECT recipient_id, time_id, sender_id, sender_ip

    FROM mt_fact_scanned

    GROUP BY recipient_id, time_id, sender_id, sender_ip

    HAVING COUNT(*) > 1

    I can use one of the duplicate rows from the above result and find the one with the highest value in the msg_count Column with:

    Select top(1) stats_id from mt_fact_scanned

    where recipient_id = 55 and time_id =444 and sender_id = 4324 and sender_ip = 34344

    order by msg_count -- (values for where clause provided by previous SQL query)

    i can the drop all the duplicate rows by delete the ones without the Stats_id value from the above query like:

    delete from mt_fact_scanned

    where recipient_id = 55 and time_id =444 and sender_id = 4324 and sender_ip = 34344

    and stats_ID != (value from above 'select top(1)' query)

    but the real trick, and this is where i am stumped... is how to jam this into one query.

    i wrote a VBscript that did this successfully by looping through rows from the Temp table and looping over and over... but at this rate it will take forever.

    i was thinking that there was a way in T-SQL to do this task in a larger query to go through and remove the duplicates based on the highest value in the Msg_Count Column

    Thanks to all that reply

  • Here's an example that might work. It's completely untested due to the lack of sample data. Therefore, I used a SELECT instead of DELETE statement.

    It doesn't require the temp table...

    ;WITH cte AS (

    SELECT max(stats_id) AS max_id, recipient_id, time_id, sender_id, sender_ip

    FROM mt_fact_scanned

    GROUP BY recipient_id, time_id, sender_id, sender_ip

    HAVING COUNT(*) > 1)

    SELECT src.*

    FROM mt_fact_scanned src

    INNER JOIN cte

    ON src.recipient_id = cte.recipient_id

    AND src.time_id = cte.time_id

    AND src.sender_id = cte.sender_id

    AND src.sender_ip = cte.sender_ip

    WHERE src.stats_id < cte.m_id



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • DELETE t

    FROM dbo.mt_fact_scanned t JOIN

    (SELECT recipient_id, time_id, sender_id, sender_ip, MAX(msg_count) MaxCount

    FROM dbo.mt_fact_scanned

    GROUP BY recipient_id, time_id, sender_id, sender_ip)x

    ON t.recipient_id = x.recipient_id AND t.time_id = x.time_id

    AND t.sender_id = x.sender_id AND t.sender_ip = x.sender_ip AND t.msg_count x.MaxCount

  • Untested

    WITH CTE AS (

    SELECT *,

    ROW_NUMBER() OVER(PARTITION BY recipient_id, time_id, sender_id, sender_ip ORDER BY msg_Count DESC) AS rn

    FROM dbo.MT_Fact_scanned)

    DELETE FROM CTE

    WHERE rn>1

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • HansVE ,

    thanks for the help.. actually your script pointed pointed out something to me that was a big help.

    in our table it turns out that there are rows where the msg.Count column are equal to each other.

    for example 3 identical rows where each have a value of 1 for the message count.

    your script... which did exactly what i described in my original post, and I thank you for that, could not be used since in my testing, any rows where the msg.count value were all equal... all would be deleted.

    i know i did not point this out in my original post.. because i didn't know it.

    but thanks for helping me out.

    Leroy

  • mark,

    thanks for the help... your script worked perfectly!

    Even though it turns out that some of the values in the msg_count column were all identical.

    your script still was able to deal with this issue.

    thanks

    Leroy L

  • lmu92,

    thanks you for your help.

    the script your provided worked as well...

    i am curious to see if there is much performance difference between your script and Mark's

    i will test and post the results later.

    Thanks

    Leroy

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply