Query execution plan different between production/test - same data

  • I have a query that runs much much slower on my production server than my test server. By tweaking the query I can make it run fast in production. Why can't I make prod act like test? The difference in execution time is 15 minutes! (test - 5 seconds/prod - 15 minutes). I need to make PROD more like TEST, or at least TEST should be slower than PROD (what is the point of load testing?)

    My test server has the same SQL version/patch level and OS version/patch level as my production server. The data is a replica. The query execution plans are different. The production server has more ram, faster CPUs (dual), and faster disk. Production is under light utilization (<20%). It is reindexed each night. I have updated statistics on all tables involved. I ran a server configuration and schema configuration comparison with RedGate software and all items are indenticle (save ram/cpu/names of jobs). When the slow query runs on production it chews one of the CPU's @50% for all 15 minutes. The problem has slowly gotten worse of the past month. The database is only 120mb and is the only db so far on this server. The DB server is dedicated, there are no other uses for it. Its paired app server is under very low utilization also. As I said, by tweaking the query the query time goes from 15 minutes down to 4 seconds without changing the result set. The main gotcha is that my test server runs the fast execution plan with or without tweaking the query. I need it to run fast without tweaking so my users can run ad hoc queries without me having to get involved. I know it is possible, look at my test box!!

    I am stumped on this performance problem. Please help if you can with any thoughts or suggestions for getting MSSQL to run consistant with respect to execution plans. Our Oracle DBA is laughing...

  • Part of the problem is the execution plans are based on the load on the server. Since I've managed people with both Oracle, DB2, and SQL Server, the SQL installations by far run smoother, so I'm not sure I'd care what an Oracle DBA thinks.

    To get better load testing, part of the thing may be to use Profiler and generate some of a load on the test server. Be sure you update stats, reindex, etc. the same on the test server, don't just restore.

     

  • How many CPUs does your test server have? What are the difference in execution plans betwen production and test servers? Can you post sp_configure's results from bth server here?

  • Prod has 2 cpus, Test has 1.


    Difference in execution plans (excerpt):

    Prod (15 minutes)

    | | |--Sort(ORDER BY[Shifts].[dtShiftDate] ASC, [Shifts].[chMachineID] ASC))

    | | | |--Nested Loops(Inner Join, OUTER REFERENCES[Operations].[vchOperation]))

    | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Operations].[PK__Operations__76969D2E] AS [Operations]))

    | | | |--Nested Loops(Inner Join, WHEREIsFalseOrNull(Convert([Operations].[bitIncludeInMachineHrs])=1))OUTER REFERENCES[Shifts].[chShiftID]))

    | | | |--Nested Loops(Left Outer Join, OUTER REFERENCES[Shifts].[chShiftID]))

    | | | | |--Nested Loops(Left Outer Join, OUTER REFERENCES[Shifts].[vchOperation]))

    | | | | | |--Nested Loops(Left Outer Join, OUTER REFERENCES[Shifts].[chPartID], [Shifts].[dtShiftDate]) WITH PREFETCH)

    | | | | | | |--Hash Match(Right Outer Join, HASH[Union1021])=([Shifts].[intShiftCoordinator]), RESIDUAL[Union1021]=[Shifts].[intShiftCoordinator]))

    | | | | | | | |--Sort(DISTINCT ORDER BY[Union1018] ASC, [Union1019] ASC, [Union1020] ASC, [Union1021] ASC, [Union1022] ASC, [Union1023] ASC))

    | | | | | | | | |--Concatenation

    | | | | | | | | |--Compute Scalar(DEFINE[Expr1012]=rtrim([Employees].[FirstName])+' '+[Employees].[LastName], [Expr1013]=Convert([Employees].[Emp #])+replicate('*', 18-len(Convert([Employees].[Emp #

    | | | | | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Employees].[PK_Employees]), WHEREConvert([Employees].[Terminated])1))

    | | | | | | | | |--Compute Scalar(DEFINE[Expr1016]=rtrim([IntroAssociates].[FName])+' '+[IntroAssociates].[LName], [Expr1017]=Convert([IntroAssociates].[TempID])+replicate('*', 18-len(Convert([Intro

    | | | | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[IntroAssociates].[PK_IntroAssociates]), WHERE[IntroAssociates].[Status]='Interim'))

    | | | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Shifts].[PK__Shifts__0425A276]), WHERE([Operations].[vchOperation]=[Shifts].[vchOperation] AND Convert([Shifts].[dtShiftDate])>dateadd(month,

    | | | | | | |--Clustered Index Seek(OBJECT[ShopFloor].[dbo].[PartMetrics].[PK__PartMetrics__1CBC4616]), SEEK[PartMetrics].[chPartID]=[Shifts].[chPartID] AND [PartMetrics].[dtMonthYear]=dateadd(day, -datepar

    | | | | | |--Clustered Index Seek(OBJECT[ShopFloor].[dbo].[Operations].[PK__Operations__76969D2E]), SEEK[Operations].[vchOperation]=[Shifts].[vchOperation]) ORDERED FORWARD)

    | | | | |--Hash Match(Cache, HASH[Shifts].[chShiftID]), RESIDUAL[Shifts].[chShiftID]=[Shifts].[chShiftID]))

    | | | | |--Compute Scalar(DEFINE[Expr1032]=If ([Expr1093]=0) then NULL else [Expr1094]))

    | | | | |--Stream Aggregate(DEFINE[Expr1093]=COUNT_BIG(datediff(minute, [OperatorLogins].[dtLogin], [OperatorLogins].[dtLogout])), [Expr1094]=SUM(datediff(minute, [OperatorLogins].[dtLogin], [OperatorLogin

    | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[OperatorLogins].[PK_OperatorLogins]), WHERE[OperatorLogins].[dtLogout]NULL AND [OperatorLogins].[chShiftID]=[Shifts].[chShiftID]))

    | | | |--Hash Match(Cache, HASH[Shifts].[chShiftID]), RESIDUAL[Shifts].[chShiftID]=[Shifts].[chShiftID]))

    | | | |--Stream Aggregate(DEFINE[Expr1037]=MAX([OperatorLogins].[dtLogout]), [Expr1038]=MIN([OperatorLogins].[dtLogin])))

    | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[OperatorLogins].[PK_OperatorLogins]), WHERE[OperatorLogins].[chShiftID]=[Shifts].[chShiftID]))

    | | |--Clustered Index Seek(OBJECT[ShopFloor].[dbo].[Parts].[PK__Parts__7D78A4E7] AS [Parts]), SEEK[Parts].[chPartID]=[Shifts].[chPartID]) ORDERED FORWARD)


    Test (5 seconds)

    | | |--Sort(ORDER BY[Shifts].[dtShiftDate] ASC, [Shifts].[chMachineID] ASC))

    | | | |--Hash Match(Inner Join, HASH[Operations].[vchOperation])=([Shifts].[vchOperation]), RESIDUAL[Operations].[vchOperation]=[Shifts].[vchOperation]))

    | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Operations].[PK__Operations__76969D2E] AS [Operations]))

    | | | |--Nested Loops(Inner Join, WHEREIsFalseOrNull(Convert([Operations].[bitIncludeInMachineHrs])=1))OUTER REFERENCES[Shifts].[chShiftID]))

    | | | |--Hash Match(Right Outer Join, HASH[OperatorLogins].[chShiftID])=([Shifts].[chShiftID]), RESIDUAL[OperatorLogins].[chShiftID]=[Shifts].[chShiftID]))

    | | | | |--Compute Scalar(DEFINE[Expr1033]=If ([Expr1094]=0) then NULL else [Expr1095]))

    | | | | | |--Hash Match(Aggregate, HASH[OperatorLogins].[chShiftID]), RESIDUAL[OperatorLogins].[chShiftID]=[OperatorLogins].[chShiftID]) DEFINE[Expr1094]=COUNT_BIG(datediff(minute, [OperatorLogins].[dtLogin],

    | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[OperatorLogins].[PK_OperatorLogins]), WHERE[OperatorLogins].[dtLogout]NULL))

    | | | | |--Hash Match(Right Outer Join, HASH[Operations].[vchOperation])=([Shifts].[vchOperation]), RESIDUAL[Operations].[vchOperation]=[Shifts].[vchOperation]))

    | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Operations].[PK__Operations__76969D2E]))

    | | | | |--Hash Match(Right Outer Join, HASH[PartMetrics].[chPartID], [PartMetrics].[dtMonthYear])=([Shifts].[chPartID], [Expr1092]), RESIDUAL[PartMetrics].[chPartID]=[Shifts].[chPartID] AND [PartMetrics].[dt

    | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[PartMetrics].[PK__PartMetrics__1CBC4616]))

    | | | | |--Compute Scalar(DEFINE[Expr1092]=dateadd(day, -datepart(day, Convert([Shifts].[dtShiftDate]))+1, [Shifts].[dtShiftDate])))

    | | | | |--Hash Match(Right Outer Join, HASH[Union1021])=([Shifts].[intShiftCoordinator]), RESIDUAL[Union1021]=[Shifts].[intShiftCoordinator]))

    | | | | |--Sort(DISTINCT ORDER BY[Union1018] ASC, [Union1019] ASC, [Union1020] ASC, [Union1021] ASC, [Union1022] ASC, [Union1023] ASC, [Union1024] ASC))

    | | | | | |--Concatenation

    | | | | | |--Compute Scalar(DEFINE[Expr1012]=rtrim([Employees].[FirstName])+' '+[Employees].[LastName], [Expr1013]=Convert([Employees].[Emp #])+replicate('*', 18-len(Convert([Employees].[

    | | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Employees].[PK_Employees]), WHEREConvert([Employees].[Terminated])1))

    | | | | | |--Compute Scalar(DEFINE[Expr1016]=rtrim([IntroAssociates].[FName])+' '+[IntroAssociates].[LName], [Expr1017]=Convert([IntroAssociates].[TempID])+replicate('*', 18-len(Convert([

    | | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[IntroAssociates].[PK_IntroAssociates]), WHERE[IntroAssociates].[Status]='Interim'))

    | | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[Shifts].[PK__Shifts__0425A276]), WHEREConvert([Shifts].[dtShiftDate])>dateadd(month, -5, getdate()-Convert(datepart(day, getdate()))+'Ja

    | | | |--Hash Match(Cache, HASH[Shifts].[chShiftID]), RESIDUAL[Shifts].[chShiftID]=[Shifts].[chShiftID]))

    | | | |--Stream Aggregate(DEFINE[Expr1038]=MAX([OperatorLogins].[dtLogout]), [Expr1039]=MIN([OperatorLogins].[dtLogin])))

    | | | |--Index Spool(SEEK[OperatorLogins].[chShiftID]=[Shifts].[chShiftID]))

    | | | |--Clustered Index Scan(OBJECT[ShopFloor].[dbo].[OperatorLogins].[PK_OperatorLogins]))

    | | |--Clustered Index Seek(OBJECT[ShopFloor].[dbo].[Parts].[PK__Parts__7D78A4E7] AS [Parts]), SEEK[Parts].[chPartID]=[Shifts].[chPartID]) ORDERED FORWARD)


    Production SP_Configure

    name minimum maximum config_value run_value

    ----------------------------------- ----------- ----------- ------------ -----------

    affinity mask -2147483648 2147483647 0 0

    allow updates 0 1 0 0

    awe enabled 0 1 0 0

    c2 audit mode 0 1 0 0

    cost threshold for parallelism 0 32767 5 5

    Cross DB Ownership Chaining 0 1 0 0

    cursor threshold -1 2147483647 -1 -1

    default full-text language 0 2147483647 1033 1033

    default language 0 9999 0 0

    fill factor (%) 0 100 0 0

    index create memory (KB) 704 2147483647 0 0

    lightweight pooling 0 1 0 0

    locks 5000 2147483647 0 0

    max degree of parallelism 0 32 0 0

    max server memory (MB) 4 2147483647 2147483647 2147483647

    max text repl size (B) 0 2147483647 65536 65536

    max worker threads 32 32767 255 255

    media retention 0 365 0 0

    min memory per query (KB) 512 2147483647 1024 1024

    min server memory (MB) 0 2147483647 0 0

    nested triggers 0 1 1 1

    network packet size (B) 512 65536 4096 4096

    open objects 0 2147483647 0 0

    priority boost 0 1 0 0

    query governor cost limit 0 2147483647 0 0

    query wait (s) -1 2147483647 -1 -1

    recovery interval (min) 0 32767 0 0

    remote access 0 1 1 1

    remote login timeout (s) 0 2147483647 20 20

    remote proc trans 0 1 0 0

    remote query timeout (s) 0 2147483647 600 600

    scan for startup procs 0 1 0 0

    set working set size 0 1 0 0

    show advanced options 0 1 1 1

    two digit year cutoff 1753 9999 2049 2049

    user connections 0 32767 0 0

    user options 0 32767 0 0

    TEST SP_Configure

    name minimum maximum config_value run_value

    ----------------------------------- ----------- ----------- ------------ -----------

    affinity mask -2147483648 2147483647 0 0

    allow updates 0 1 0 0

    awe enabled 0 1 0 0

    c2 audit mode 0 1 0 0

    cost threshold for parallelism 0 32767 5 5

    Cross DB Ownership Chaining 0 1 0 0

    cursor threshold -1 2147483647 -1 -1

    default full-text language 0 2147483647 1033 1033

    default language 0 9999 0 0

    fill factor (%) 0 100 0 0

    index create memory (KB) 704 2147483647 0 0

    lightweight pooling 0 1 0 0

    locks 5000 2147483647 0 0

    max degree of parallelism 0 32 0 0

    max server memory (MB) 4 2147483647 2147483647 2147483647

    max text repl size (B) 0 2147483647 65536 65536

    max worker threads 32 32767 255 255

    media retention 0 365 0 0

    min memory per query (KB) 512 2147483647 1024 1024

    min server memory (MB) 0 2147483647 0 0

    nested triggers 0 1 1 1

    network packet size (B) 512 65536 4096 4096

    open objects 0 2147483647 0 0

    priority boost 0 1 0 0

    query governor cost limit 0 2147483647 0 0

    query wait (s) -1 2147483647 -1 -1

    recovery interval (min) 0 32767 0 0

    remote access 0 1 1 1

    remote login timeout (s) 0 2147483647 20 20

    remote proc trans 0 1 0 0

    remote query timeout (s) 0 2147483647 600 600

    scan for startup procs 0 1 0 0

    set working set size 0 1 0 0

    show advanced options 0 1 1 1

    two digit year cutoff 1753 9999 2049 2049

    user connections 0 32767 0 0

    user options 0 32767 0 0

  • Try configure SQL Server production to run without parallel.

    sp_configure 'allow updates', 1

    go

    reconfigure with override

    go

    sp_configure 'max degree of parallelism', 1

    go

    reconfigure with override

    go

    sp_configure 'allow updates', 0

    go

    reconfigure with override

  • Thanks Steve, I didn't think of using profiler to simulate usage. If the query optimizer uses utilization as a statistic then it could be affecting. However, I can't the slow query to run fast on the prod server without tweaking the query (unless you count the times it used to run fast before whatever event makes it run slow happened, several weeks ago). And when I say tweaking the query, there are two things that will make this query run fast on prod.

    1) change 'where t1.field = t2.field' type joins to inner joins

    2) add a superficial where clause that doesn't restrict the result set (like 'where dtShift <= getdate()').

    As far as the Oracle DBA, you words don't bring much comfort for me at the moment haa haa.

  • I ran this and then re-ran the prod server query. I stopped it after 3 minutes, doesn't seem to have helped. I set it back to 0.

    Thanks for the idea.

  • Do you compare the execution plan after the change?

  • Can you verify the MDAC versions on test and production ?

  • For what it's worth, try running the query on the prod server, appending OPTION (MAXDOP 1) to it.  Happened to a few of my queries before on multi-CPU systems, where parallelism made the query worse.  There's an MSDN document on this too, where using MAXDOP is one of the workarounds.

    SQL BAK Explorer - read SQL Server backup file details without SQL Server.
    Supports backup files created with SQL Server 2005 up to SQL Server 2017.

  • Another possibility is that your server cached ‘slow’ execution plan and keeps reuse it.

    If you can find a query that looks like your ‘slow’ query in sql column of  

    select * from syscacheobjects   (it does not show all query).

     

    You could try to clear cache by

    DBCC FREEPROCCACHE   

     

    then ran you ‘fast’ query (with dtShiftDate <= GETDATE() ) and HOPE that SQL server will reuse this plan next time somebody will submit ‘slow’ query.

     

    Igor

  • Can you try this query with MAXDOP(1) hint.

    May be this will solve the problem

  • Thank you for the suggestion Igor! I searched that syscacheobjects table and found about 15 or 20 items similar to my query. I cleared the cache and then verified the table was empty. I then ran the tweaked query (with your previously proposed INNER JOIN syntax) and the results came back in 5 seconds. I then searched the syscacheobjects table and no new item had been cached. I then ran my original Excel query and it once again ran slowly (of course returning the same recordset sorted the same way, 5025 records).

    Also it is interesting, the 'dtShiftDate <= GETDATE()' tweak no longer runs fast, it now also chooses the slower execution plan. I have to use the INNER JOIN syntax to get the proper execution plan. This is so frustrating!!!

  • According to the MS Component Checker:

    Test is MDAC 2.8 RTM

    Prod is MDAC 2.8 RTM

    I think that is up to date, thanks for the idea!

  • Thank you for the idea, I tried OPTION(MAXDOP 1) again and it has no affect on the slow query execution plan nor execution time.

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply