eliminating duplicates

  • Jeff Moden (6/28/2008)


    If you're going to spend that much time on it... write it as an article and submit it, instead. 😉

    Heh. Now you tell me!?!

    Seriously, though, I thought about it but writing in that way is very slow and difficult for me (it's a mild language processing disorder, apparently). Writing here is hard enough, but adapting to the form or an article, especially anything long, presents an almost insurmountable barrier to me...

    Well, maybe. The explanation is already a lot longer than I expected. Maybe when I finish it I will send it to you and you can suggest what changes I should make for it to become an article?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (6/28/2008)


    Writing here is hard enough, but adapting to the form or an article, especially anything long, presents an almost insurmountable barrier to me...

    Heh... nah... you have the knack and just don't know it... pretty you're writing one of the long winded forum explanations that you and I frequently do with and without each other. Then, go back and throw in a couple of "headers" that seem to isolate subjects. Then, make sure you don't have any duplicate headers by combining areas that appear to be alike. Check it for readability and "revelation" and you have a nice article.

    Well, maybe. The explanation is already a lot longer than I expected. Maybe when I finish it I will send it to you and you can suggest what changes I should make for it to become an article?

    See what I mean? :w00t: You're already starting to realize that an article is really just a longer explanation. 😉 Sure... Email your final draft (I prefer something I can load in Word 97... I think it'll handle all the way up to 2003 if you don't go nuts with features).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • OK, I sent you the draft. Heh, I didn't see your reply before I sent, hopefully the mess o' text I sent will be alright.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (6/28/2008)


    OK, I sent you the draft. Heh, I didn't see your reply before I sent, hopefully the mess o' text I sent will be alright.

    I gotta wonder how some of these spam filters work.... it went right straight to "bulk"... good thing you told me you sent it or I might have never seen it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hey, did you get a chance to look at that draft yet?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • rbarryyoung (6/29/2008)


    Hey, did you get a chance to look at that draft yet?

    Barely... I did a quick scan and now I need to do a full read...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hello,

    I had read all responses, and I have to recognise that Barry's approach, is very interesting, and clever solution...

    I have test the sql-steps but i found is a little bit confusing, it's no more clear do the following thing:

    1. Insert into TAB

    Select Name

    , Age

    , CHAR( (2*Count(*)) + (CASE Sex When 'M' Then 0 Else 1 End) )

    From TAB

    Group By Name, Age, Sex

    /* That's barry's great approch... */

    Delete from TAB

    Where Sex IN('F','M')

    /* Now remove non compressed values */

    UPDATE TAB

    SET Sex = CASE WHEN Ascii(Sex) % 2 = 0 Then 'M' Else 'F' END

    /* Get sex orginal value */

    I think this is a more clear, I'm missing something....?

    Thank's in advance.

  • rbarryyoung,

    I am eagerly expecting your article.

    karthik

  • Thanks, folks. I'll try to get it done and in the queue by tonight. I'll be sure to post a link to it from here then.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Nicking Barry's excellent idea (and Janines test data), here's another way which is quite fast:

    DROP TABLE #Testing

    CREATE TABLE #Testing (

    FirstCol varChar(5),

    SecondCol INT,

    ThirdCol Char(1)

    )

    INSERT INTO #Testing (FirstCol, SecondCol, ThirdCol)

    SELECT 'ABC',24,'M' UNION ALL --

    SELECT 'ABC',24,'M' UNION ALL

    SELECT 'DEF',24,'M' UNION ALL --

    SELECT 'DEF',24,'F' UNION ALL

    SELECT 'GHI',26,'F' UNION ALL --

    SELECT 'GHI',26,'F' UNION ALL

    SELECT 'GHI',26,'F' UNION ALL

    SELECT 'GHI',26,'F' UNION ALL

    SELECT 'GHI',26,'F' UNION ALL

    SELECT 'LMN',27,'M' UNION ALL --

    SELECT 'LMN',27,'M' UNION ALL

    SELECT 'LMN',27,'M' UNION ALL

    SELECT 'PQRS',25,'F' UNION ALL --

    SELECT 'XYZ',24,'M' UNION ALL --

    SELECT 'XYZ',25,'M' --

    DECLARE @ItemID INT

    DECLARE @FirstCol varChar(5), @SecondCol INT, @ThirdCol Char(1)

    SET @ItemID = 1

    UPDATE #Testing SET

    ThirdCol = CHAR((@ItemID*2) + CASE ThirdCol WHEN 'M' THEN 1 ELSE 0 END),

    @ItemID = CASE WHEN @FirstCol = FirstCol AND @SecondCol = SecondCol AND @ThirdCol = ThirdCol THEN @ItemID+1 ELSE 1 END,

    @FirstCol = FirstCol, @SecondCol = SecondCol, @ThirdCol = ThirdCol

    DELETE FROM #Testing WHERE ASCII(ThirdCol) = 2 OR ASCII(ThirdCol) = 3

    UPDATE #Testing SET ThirdCol = CASE (ASCII(ThirdCol) & 1) When 1 Then 'M' Else 'F' End

    SELECT * FROM #Testing

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Step1: You need to create a table containing only the duplicate records by first selecting all the duplicate records into a new table i.e.

    Insert into temptable (Select * from tablename group by (Name,Age,Sex) having count (Name,Age,Sex) > 1

    Step2: Delete all the duplicate records contained in your table name.

    Step3: Insert the records back from the temptable to your tablename.

  • mlando.mngomezulu (7/1/2008)


    Step1: You need to create a table containing only the duplicate records by first selecting all the duplicate records into a new table i.e.

    Insert into temptable (Select * from tablename group by (Name,Age,Sex) having count (Name,Age,Sex) > 1

    Step2: Delete all the duplicate records contained in your table name.

    Step3: Insert the records back from the temptable to your tablename.

    This won't work.

    Firstly, the OP states no intermediate tables are to be created.

    Secondly, it's not a dedupe exercise - the aim is to eliminate all unique rows and one row from each group of dupes, leaving the rest. Interestingly, if you put the deleted rows into a new table, it would be a dedupe of the original table.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Actually the first step makes sure that the temptable only have one record per corresponding duplicate records. This is achieved by the "group by" clause. This is grouped by the combination of the troublesome fields i.e. ('Name','Age','Sex'). Hence when you insert the records back, your table will no longer have duplicate records.

  • mlando.mngomezulu (7/1/2008)


    Actually the first step makes sure that the temptable only have one record per corresponding duplicate records. This is achieved by the "group by" clause. This is grouped by the combination of the troublesome fields i.e. ('Name','Age','Sex'). Hence when you insert the records back, your table will no longer have duplicate records.

    Yes of course - but actually it's not a deduping exercise. The aim is to remove all rows except those which would be removed by a dedupe procedure.

    The finished table, after processing correctly according to the OP's criteria, will have dupes because only one row of each dupe group is to be removed. If there are three dupes of one row, then two will remain. It's the exact opposite of your suggestion.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • You can say it easier than even that...

    Remove one of every type of row that can be uniquely identified even if there is only one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 46 through 60 (of 137 total)

You must be logged in to reply to this topic. Login to reply