Self-Join Update

  • My company gets http logs imported to SQL for processing. Table definition (partial) is

    log_user  varchar(10)

    ip           varchar(50)

    log_date  datetime

    In about 60% of the rows, log_user column is null. I then get the log_user where the ip and log_date match. Here is the update query

     UPDATE http_log_process

     SET log_user = b.log_user

     FROM http_log_process INNER JOIN http_log_process b ON

     CONVERT(varchar(12), http_log_process.log_date, 101) = CONVERT(varchar(12), b.log_date, 101)

     AND http_log_process.ip = b.ip

     WHERE http_log_process.log_user IS NULL and b.log_user IS NOT NULL

     AND http_log_process.log_date IS NOT NULL

    This query takes about 2 minutes for the update. The table has 20K rows and 10K rows are updated. Can this query be fine-tuned ?

  • Well, I doubt it.

    I'm guessing that you're getting scans all the way, because you have functions on the dates and also negations in the wehere clause, all things that invalidates indexusage. But... since the volume updated is 50% of the data, I sincerely doubt that any indexes would be used anyway.

    My guess is that this update is happening as fast as the server can muster, which makes the bottleneck the disksystem, and not the query.

    Though it would be interesting to see what the current plan looks like.. Are you getting tablescans, or are any indexes involved? Also, are there any triggers on the table?

    /Kenneth

  • First, make sure you have indexes in ip and log_date 

    then instead of two converts:

    CONVERT(varchar(12), http_log_process.log_date, 101) = CONVERT(varchar(12), b.log_date, 101)

     you can use

    datediff(day,http_log_process.log_date,b.log_date) = 0

    hth

     


    * Noel

  • The current plan does not show any table scans (there was a Hash Match/Inner Join taking the most processing time). Also no triggers on the table.

    I also tried having a computed column to the date only (without the time) and did not have any performance gain.

    I have indexes on ip and log_date. Changing the CONVERT to Datediff brought the update to under 25 seconds.

    Thanks guys for your input.

  • You wouldn't see any boost in performance in the computed column solution unless it was indexed as well...

  • Classic issue of needing indexing to speed up selecting process, but at the same time not wanting indexing as this may slow down the update if you are changing a field with an index on. 

    Is there an index on the log_user field?

  • Yes, there is an index on log_user.

    I got an error when creating index on the computed column

    alter table http_log_process add log_date_only AS

    CONVERT(datetime,CONVERT(varchar(12), log_date, 101) )

    CREATE INDEX http_log_process_idx6 ON http_log_process ( log_date_only)

    Cannot create index because the key column 'log_date_only' is non-deterministic or imprecise.

    But I was able to create the index if the computed column was

    alter table http_log_process add log_date_only AS

    CONVERT(varchar(12), log_date, 101).

  • try:

    alter table http_log_process add log_date_only AS

    dateadd(d,0,datediff(day,0,log_date))

     


    * Noel

  • I was able to create the computed column with dateadd(d,0,datediff(day,0,log_date)) but still got the same error while creating the index on the computed column.

    And the update query was taking over 5 minutes (killed the update ) when using the computed column. When I use the  

    datediff(day,http_log_process.log_date,b.log_date) = 0

    the update was done in 25 seconds.

  • Check the books online. You need to convert the data to the ainsi complient format (112 I think). That will make the column deterministic so you can index it.

  • Actually I forgot that datediff is NOT deterministic even though dateadd is

    The convert is only deterministic with datetime data type only when style is specified. Because you are getting rid of the time, further conversion may not be necessary.

    try:

     alter table http_log_process add log_date_only AS

    CONVERT(char(8), log_date, 112) -- this gives yyyymmdd 

     


    * Noel

  • Whoa - this update was done in 1 sec.

    Thanks Joe - I made one small change though

    UPDATE http_log_process

       SET log_user

           = (SELECT MAX(log_user)

                FROM http_log_process AS B

               WHERE http_log_process.log_date = B.log_date

                 AND http_log_process.ip = B.ip AND b.log_user IS NOT NULL)

      WHERE log_user IS NULL;

     

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply