UPDATE Query taking hours to run in DTS Package???

  • I have a table Claimdata with about 19 million rows. This table is imported weekly from SAS. The Import takes about an hour using the Transform Data Task and then the following update is run using the Execute SQL Task which takes 4-5 hours to complete.

     

    UPDATE Claimdata

    SET Claimdata.LagCategory = FinancialCategories.LagCategory

    FROM FinancialCategories

    INNER JOIN Claimdata ON

    FinancialCategories.ClaimType = Claimdata.ClaimType

     

    Here’s the FinancialCategories DDL

     

    CREATE TABLE [FinancialCategories] (ClaimType varchar (50) ,

                                                                      LagCategory varchar (50) ,

          FinancialCategory varchar (50)

    ) ON [PRIMARY]

    GO

     

    There are no indexes on any of the fields involved in the query. Does anyone have any suggestions on how I could reduce the time it takes to UPDATE this field? I really appreciate any advice on this. Thanks !

  • also, there are only about 15 rows in FinancialCategories. Maybe I should just put it in a function?

  • 1. Since there is no clustered index, the table may be a highly fragmented heap, requiring more disk I/O than necessary

    2. If the database is running in full recovery mode, the UPDATE will be generating large amounts of log activity

    3. If the column Claimdata.LagCategory is initially loaded as NULL, and then set to a value, this prevents "Update in place", so each row UPDATE will actually be implemented as a DELETE followed by an INSERT behind the scenes. If #2 is true, then the combination of #2 & #3 is a real performance killer

  • Thanks for the reply PW.

    1.) Do you mean clustered index on FinancialLagCategories or Claimdata?

    2.) It is running in Simple recovery mode.

    3.) Interesting, there is no transformation mapped to that field, so you're right, it probably is putting NULL into that column. How can I get around this?

    Thanks !

  • You could try putting a transformation from any field into Claimdata.LagCategory then choose activex script and put a fixed literal in it equal to the length of the longest value that will be placed in this fileld. Or you could use a DEFAULT on the Column definition.

    If you put a Default on I would try to do it when Claimdata is empty (if poss).

    If LagCategory will not change then you could look it up through your join on FinancialCategories

     

  • I would recommend to start by setting up an index on ClaimType on both tables.  Depending on the data you get it could be clustered index.  Having said that, if there is a date field on Claimdata table, I would use that for Clustered index in combination with ClaimType and break the update by date range. 

    The general rule is when you building an index, your goal is to decrease number of records SQL will search through, ideally jumping straight to the record it needs.  Also, if you break update statement by date it will help too.

    Hope this helps.

    Vadim.

  • A clustered index on the large table should help performance, but as long as you use some index on the referenced columns you should get better performance.

    Also adding this to the beginning

    set rowcount 100,000

    and this to the end

    where Claimdata.LagCategory != FinancialCategories.LagCategory

    then either wrap it in a cursor, or use Sql Agent to run 10 or 20 updates with no cursor.

    This should keep your transactions small and limit the size of each transaction nicely. You should change the rowcount to fit your environment.

    If you want it automated, I'd recommend the cursor, once you've sized the rowcount accordingly.

  • What about not running the update at all?

    What is the cost of modifying the views? that reference the data so that the join above is utilised.

    Not running your update would leave your data normalised to so extent.

  • Thanks for all the replies ! The Clustered Index def. speeds up the update, but it takes a long time to add the index to the table (all indexes are dropped before the load)

    What about using a Lookup in the Transform data task. I've never done one before, so 1.)do you think this would be a good idea? and 2.) Would this be the right way to go about doing it? Thanks !

    Using the Lookup Query below:

    SELECT     FinancialCategories.LagCategory

    FROM         FinancialCategories INNER JOIN CLAIMDATASET ON FinancialCategories.ClaimType = CLAIMDATASET.ClaimType

    WHERE     (CLAIMDATASET.ClaimType = ?)

    Function Main()

    DTSDestination("LagCategory") = DTSLookups("lkp_LagCategory").Execute(DTSSource("ClaimType"))

     Main = DTSTransformStat_OK

    End Function

  • Have you considered leaving the Clustered index in place and just keeping it at a low fill factor/Padding?

    This way your data  should load almost as fast, but you won't have to rebuild any idexes, just add the NC indices that support your update. When that is done, add the rest back.

     

    19 milion rows isn't very much. Maybe you need to look at your Disk Drive configuration and available memory too. A standard table with 19 million rows should take 10-15 minutes max to add a one or two column Clustered index.

  • Lookups can be useful if you need to look something up on a different data source.

    With 2 tables in the same data source, a lookup will nearly always be slower. You can always benchmark it, just remember to set the cache size large enough.

    >>

    SELECT     FinancialCategories.LagCategory

    FROM         FinancialCategories INNER JOIN CLAIMDATASET ON FinancialCategories.ClaimType = CLAIMDATASET.ClaimType

    WHERE     (CLAIMDATASET.ClaimType = ?)

    This looks like different functionality than the original UPDATE.

    Where did table CLAIMDATASET enter the picture, and what size is it ?

  • A couple of things that we have done with similar issues are

    1. use with (tablockx) to avoid the use of multiple page locks  (if this is a problem you will see screenfuls of page locks building up on tempdb)

    2. If possible split the update into several updates - for example write a cursor to pick up each category, then for each category update the table.  The aim being to split the 19 million rows into 19 x 1 million.  This can work because you're less likely to be over-stretching the memory.

  • You could also try avoiding the update altogether by importing your SAS data into a staging table, then having a second transformation that inserts the joined data into Claimdata i.e with transformation source SQL such as:

    SELECT StagingClaimdata.<attributes>, FinancialCategories.LagCategory

    FROM StagingClaimdata

    INNER JOIN  FinancialCategories

      ON FinancialCategories.ClaimType = StagingClaimdata.ClaimType.

     

    If the import from SAS is taking about an hour, this should take much the same time.


    Cheers

    Filet

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply