DELETE DUPLICATE RECORDS

  • tosscrosby (5/23/2008)


    Not a problem. I thought maybe you were calling SQL ORACLE out for a challenge! I don't have any issues here with duplicates. I just thought if you already had something, I'd like to see it. Like I said, all the different exposures help me in the long run.

    Heh... yeaup... indeed I was... not very much like me.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (5/23/2008)


    I'll crank out an example tonight...

    Wow! I think I just figured out what to write about in an article!

    Terry,

    IF there's and IDENTITY column on the table and all the the columns, including the date and, of course, excluding the IDENTITY column, and in-place DELETE of all but the "first" dupe would look like this...

    DELETE t1

    FROM dbo.JBMTest t1,

    dbo.JBMTest t2

    WHERE t1.ProductID = t2.ProductID

    AND t1.CustomerID = t2.CustomerID

    AND t1.OrderAmount = t2.OrderAmount

    AND t1.OrderDate = t2.OrderDate

    AND t1.JBMTestID > t2.JBMTestID

    Yep... you see it... a Triangular Join. Like I said in the article about them, they're not all bad...

    The neat thing about that code is it's fast (Deletes ~2700 dupes from a million row table in about 19 seconds) and it keeps the transaction log from ballooning. Only the rows that actually get deleted are logged.

    Without the IDENTITY column (or some other chronological tie breaker), something like the following is a bit tough on the transaction log... especially if you do like I do and test this stuff on a million rows.

    --===== Copy distinct rows to a temp table

    SELECT DISTINCT

    ProductID, CustomerID, OrderAmount, OrderDate

    INTO #MyHead

    FROM dbo.JBMTest

    --====== Truncate the original table

    TRUNCATE TABLE dbo.JBMTest

    --===== Copy the now distinct/unique rows back to the original

    INSERT INTO dbo.JBMTest

    (ProductID, CustomerID, OrderAmount, OrderDate)

    SELECT ProductID, CustomerID, OrderAmount, OrderDate

    FROM #MyHead

    --===== Drop the temp table (not required in a proc)

    DROP TABLE #MyHead

    That only takes about 24 seconds... but imagine the load on TempDB and imagine what all those inserts back to the original table do to the transaction log, not to mention the extreme IO.

    I haven't found a faster method for either situation... but the moral of the story is that if the table isn't protected from dupes by some "natural" Primary Key or Unique Index, you really need to have an IDENTITY column to break the ties on dupes. All of my ETL tables are always built to have one just for this purpose. Sure, I could have a Unique Index with and IGNORE DUPES option, but that's slower than importing dupes and deleting them.

    By the way, the code above that deletes based on the IDENTITY column deletes the dupes having the largest ID's leaving the "first" occurance. To keep the "last" occurance, instead, simply change the > to a <.

    Oh yeah... here's the test table I used in case you want to play... it always generates about 2,700 or 2,800 dupes...

    DROP TABLE JBMTest

    GO

    --===== Create and populate a 1,000,000 row test table.

    SELECT TOP 1000000

    JBMTestID = IDENTITY(INT,1,1),

    ProductID = ABS(CHECKSUM(NEWID()))%100+1, --1 to 100

    CustomerID = ABS(CHECKSUM(NEWID()))%500+1, --1 to 500

    OrderAmount = ABS(CHECKSUM(NEWID()))%10+1, --1 to 10

    OrderDate = CAST(ABS(CHECKSUM(NEWID())%364)+36524 AS DATETIME) --Some whole date in 2000

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1,

    Master.dbo.SysColumns t2 --Lack of join criteria makes this a CROSS-JOIN

    GO

    --===== A table is not properly formed unless a Primary Key has been assigned

    -- Takes about 1 second to execute.

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (JBMTestID)

    GO

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Sorry about that... BitBucket's method for deleting dupes when an IDENTITY column is present is just as fast as what I posted. Here's his code "Modenized" for the test table I used...

    ;WITH

    Numbered AS

    (

    SELECT RowNum = ROW_NUMBER() OVER

    (PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate

    ORDER BY ProductID,CustomerID,OrderAmount,OrderDate),

    JBMTestID

    FROM dbo.JBMTest

    )

    DELETE Numbered

    WHERE RowNum > 1

    It's pretty easy on the I/O, as well and protects the log file from unnecessary growth in a manner similar to mine.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff,

    Wouldn't something like this work faster? I normally use this technique to delete duplicates and I've never had a performance problem:

    DELETE T1

    FROM dbo.JBMTest T1

    INNER JOIN

    (SELECT ProductID, CustomerID, OrderAmount, OrderDate

    , MIN(JBMTestID) AS JBMTestID

    FROM dbo.JBMTest

    GROUP BY ProductID, CustomerID, OrderAmount, OrderDate

    HAVING COUNT(*) > 1

    ) AS X ON

    T1.ProductID = X.ProductID

    AND T1.CustomerID = X.CustomerID

    AND T1.OrderAmount = X.OrderAmount

    AND T1.OrderDate = X.OrderDate

    AND T1.JBMTestID > X.JBMTestID

    Todd Fifield

  • Heh... TODD!!!??? Why are you having me figure it out... run it against the table generator I provided and report your findings! πŸ˜€

    The answer to your question is "No", your way is not faster... your code took 00:06:43 on my machine to dupe check/dupe delete the million row example I gave. That's a wee bit more than the 00:00:19 than the triangular join or the CTE that Bit Bucket wrote. I haven't done a full analysis, but I think the GROUP BY is what is killing the performance on yours.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff,

    You got me on that one! I didn't check it against a million rows - oops!

    I'll try the triangular join next time.

    Todd

  • ;)– add temporarily identity column.

    alter table tName add tid int identity(1,1)

    –query to delete duplicate rows from table

    delete from tName where tid not in

    (select min(tid) from tName a where a.firstname = tName.firstname)

    – drop temporarily added identity column

    alter table tName drop column tid

  • this link is more than enough...

    http://qa.sqlservercentral.com/scripts/TSQL/62866/

  • vinuraj (6/13/2008)


    this link is more than enough...

    ... to bring your server right to it's knee's. It took over 10 minutes to find and delete a lousy 156 rows from a million row test table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Other than the variable about whether it's faster to select out what you want to keep, or delete out the stuff you don't want, I'd say most of the workable solutions are already covered... (and some less workable too)

    Of course if you're defining duplicates as all of the copies of some data AFTER the original one, you might care to slightly adjust the "Modenized CTE" just a smidgen:

    ;WITH

    Numbered AS

    (

    SELECT RowNum = ROW_NUMBER() OVER

    (PARTITION BY ProductID,CustomerID,OrderAmount,OrderDate

    ORDER BY ProductID,CustomerID,OrderAmount,OrderDate,JBMTestID ),--order the dupes by ID

    JBMTestID

    FROM dbo.JBMTest

    )

    DELETE Numbered

    WHERE RowNum > 1

    Depending on the data distribution, this should edge out the triangular join (i.e if the dupe groups are large - this should win out).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Dear Sir,

    i m new in this site.

    you can apply this simple script for deleting duplicate record from table

    Regards

    Rishi Kumar

  • Which script?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply