update duplicate rows in table with over 160 million rows

  • hey all...

     

    ok, table structure...

    Id int, ContactId bigint, FieldValue varchar(4000), TypeId int, CurrentValue char(1)

     

    there are currently over 170 million rows in this table... when I load new users in, their field values are separated to rows and inserted here (it adds up pretty quickly ).  I need to be able to go through and update CurrentValue to 'N' for older values (non-current values are then removed from the table and inserted into a history table).

     

    My current statement is:

    Update cd

    set CurrentValue = 'N'

    from ContactDetails cd

    where Id not in (select max(id) from ContactDetails cd1 group by contactid, typeid)

     

    this works, but I just had to stop it as it was running for over 36 hours... I need a faster way to do this.

     

    I've tried

    Update cd

    set CurrentValue = 'N'

    from ContactDetails cd

    inner join ContactDetails cd1 on cd.ContactId = cd1.ContactId and cd.TypeId = cd1.TypeId

    where cd.Id < cd1.Id

    but this actually seems slower (for a test section of 450k rows, 12 seconds for the subselect and over 30 minutes for the inner join).

     

    Anyone have any other ideas?  it's driving me insane.

     

    Thanks a ton!

    Cheers

  • Shooting in the dark here, but is it necessary to process the entire table, or when you "load new users" can you just run this update for those records that are relevant?

  • I have actually tried this using a couple different methods, all slow down the insert rather significantly... inserts of 7500 users goes from 45 - 60 seconds to 15 - 20 minutes, not good when you're loading another million users (I wish it was!)

    Cheers

  • Is there a trigger on that table?

  • nope, no triggers... tried that, was REALLY slow

     

    there are a ton (7 - 8) indexes on it... but these should only speed it up.

    Cheers

  • Have you tried forcing the use of any of the Indexes? 

    I wasn't born stupid - I had to study.

  • Have you tried doing the update in small batches (10 000 rows, wait 1 sec, next batch).

  • do you have any syntax that I could see regarding this?  I've never had to force the use of an index...

    Cheers

  • I wouldn't go with that route first. I'd first make sure that the plan is bad first and I doubt it's the case here.

  • I'm guessing the syntax for this would be something like

     

    declare @Id int

    select @Id = min(Id) from ContactDetails

    while @Id < (select max(Id) from ContactDetails)

    begin

    Update cd

    set CurrentValue = 'N'

    from ContactDetails cd

    where Id not in (select max(id) from ContactDetails cd1 group by contactid, typeid)

    and Id between @Id and @Id + 10000

    set @Id = @Id + 10000

    end

     

    no, I haven't done this... seems like the loop would cause more overhead (although, it wouldn't take NEARLY as long to rollback stuff).  Am I wrong about the loop adding overhead?

    Cheers

  • It adds a little overhead, but you get very little more locking on the tables, so your current tasks wouldn't be affected. So in essence that thing could run for days without anyone noticing it.

    I have a few ideas for a better looping way, can you add a column to the tables (they all need that).

  • Remi is right.  Plus, it has been a long time since I had to force an index (I think it was 6.5).  I will look, but it will take me a while to find it... (it is simple, but if I remember, hard to look in BOL). 

    I wasn't born stupid - I had to study.

  • Colin - also note that indexes don't speed everything up -- they actually can slow inserts and updates, as the content in each index needs to be brought in line with changes. Removing indexes that are not used can actually speed up changes to tables.

  • Remi:  I CAN add columns, but that'll definately slow stuff down, no?

     

    Farrell:  if I remember, in 6.5 you had to explicitly call indexes into use, but 7.0 and 2000 do it automatically.

     

    Merrill:  I'm waiting for the 36 hour rollback to happen, then I've actually modded the proc to remove all indexes as step 1, then add them all back as a final step (I've read indexes can slow inserts/updates over the past couple days as well)

    Cheers

  • It would take time to do the add column, but it would speed up the loop process greatly >>

    Add a new identity(1,1) column, INDEX IT

    then something like this :

    Declare @Loops as int

    Declare @BatchSize as int

    set @Loops = -1

    set @BatchSize = 10000

    While @@Rowcount > 0

    begin

    set @Loops = @Loops + 1

    waitfor delay 00:00:01

    update table set ... where NewColumn >= @Loops * @BatchSize and NewColumn < (@Loops + 1) * @BatchSize

    end

    I haven't tested this... try that on a small table to be sure it works. use the batch size that locks the ressource for the less time... and stills updates a good chuck of data.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply