Performance Problem that I can't pin down

  • Tx, and now these 2?

    SELECT* FROM sys.dm_exec_query_optimizer_info

    Don't do any of the recommendations, it's just very rough guidelines for us.

    SELECT

    optz.counter

    , optz.occurrence

    , CONVERT(DECIMAL(18,2), optz.value) AS AvgValue

    , conf.name

    , conf.value

    , conf.value_in_use

    , Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]

    , CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]

    , CASE WHEN optz.value < conf.value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]

    , CASE WHEN conf.value_in_use <> conf.value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]

    FROM

    sys.dm_exec_query_optimizer_info optz

    CROSS JOIN sys.configurations conf

    OUTER APPLY(

    SELECT

    CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime

    FROM

    sys.databases

    WHERE

    name = 'tempdb'

    ) Uptime

    WHERE

    optz.counter = 'final cost'

    AND conf.name = 'cost threshold for parallelism'

    GO

  • http://weblogs.sqlteam.com/tarad/archive/2010/11/08/max-degree-of-parallelism-server-side-setting.aspx

    FYI, for me it recommends 4, but as I said, I set to 6.

  • Bob Fazio (9/12/2011)


    http://weblogs.sqlteam.com/tarad/archive/2010/11/08/max-degree-of-parallelism-server-side-setting.aspx

    FYI, for me it recommends 4, but as I said, I set to 6.

    Ok, as I said, not even junior at tuning that big of a server. So now that you obviously took care of that... we can move on :-).

  • counteroccurrencevalue

    optimizations690881

    elapsed time690880.00212443550254748

    final cost690887.09694068392164

    trivial plan517621

    tasks17326623.287082996652

    no plan0NULL

    search 063101

    search 0 time63110.00500285216289019

    search 0 tasks6311900.215496751703

    search 1110041

    search 1 time110160.00153567538126362

    search 1 tasks11016464.412490922295

    search 2121

    search 2 time130.000307692307692308

    search 2 tasks13141.846153846154

    gain stage 0 to stage 110.00356762375769612

    gain stage 1 to stage 20NULL

    timeout271

    memory limit exceeded0NULL

    insert stmt94481

    delete stmt4931

    update stmt17041

    merge stmt0NULL

    contains subquery4151

    unnest failed5581

    tables690881.56905685502547

    hints801

    order hint11

    join hint0NULL

    view reference123571

    remote query931

    maximum DOP690880.000101320055581288

    maximum recursion level0NULL

    indexed views loaded0NULL

    indexed views matched0NULL

    indexed views used0NULL

    indexed views updated0NULL

    dynamic cursor request0NULL

    fast forward cursor request241

    counteroccurrenceAvgValuenamevaluevalue_in_useDays Server UPTIME & Last Stats ResetStats WarningCost Threshold WarningRestart Warning

    final cost690897.10cost threshold for parallelism112.83You may not have very meaningful stats because of a recent restartCost threshold for parallelism might be too lowNULL

    Just so you know, I'm pretty green with a system this big as well. Its been running on here for a year, but we hadn't seen even a hint of an issue until about a month ago.

  • Bob Fazio (9/12/2011)


    counteroccurrencevalue

    optimizations690881

    elapsed time690880.00212443550254748

    final cost690887.09694068392164

    trivial plan517621

    tasks17326623.287082996652

    no plan0NULL

    search 063101

    search 0 time63110.00500285216289019

    search 0 tasks6311900.215496751703

    search 1110041

    search 1 time110160.00153567538126362

    search 1 tasks11016464.412490922295

    search 2121

    search 2 time130.000307692307692308

    search 2 tasks13141.846153846154

    gain stage 0 to stage 110.00356762375769612

    gain stage 1 to stage 20NULL

    timeout271

    memory limit exceeded0NULL

    insert stmt94481

    delete stmt4931

    update stmt17041

    merge stmt0NULL

    contains subquery4151

    unnest failed5581

    tables690881.56905685502547

    hints801

    order hint11

    join hint0NULL

    view reference123571

    remote query931

    maximum DOP690880.000101320055581288

    maximum recursion level0NULL

    indexed views loaded0NULL

    indexed views matched0NULL

    indexed views used0NULL

    indexed views updated0NULL

    dynamic cursor request0NULL

    fast forward cursor request241

    counteroccurrenceAvgValuenamevaluevalue_in_useDays Server UPTIME & Last Stats ResetStats WarningCost Threshold WarningRestart Warning

    final cost690897.10cost threshold for parallelism112.83You may not have very meaningful stats because of a recent restartCost threshold for parallelism might be too lowNULL

    Just so you know, I'm pretty green with a system this big as well. Its been running on here for a year, but we hadn't seen even a hint of an issue until about a month ago.

    Ignore those, wrong server

  • I wouldn't take any decision based on those stats.

    The cost threshold is however at 1 with the default value being 5. This might be something I'd consider but I like to see ideally 1 full month of data. 2 days is just way to short for this.

    Is this server mostly olap or oltp?

  • counteroccurrencevalue

    optimizations197374501

    elapsed time197366460.0063553188317777

    final cost1973664745.9971011158534

    trivial plan73155451

    tasks12421103556.496919637491

    no plan0NULL

    search 014025791

    search 0 time14140600.0132889099472444

    search 0 tasks14140601250.81451847871

    search 1110162501

    search 1 time112608820.003472970856102

    search 1 tasks11260882449.037774305778

    search 222741

    search 2 time62730.155596046548701

    search 2 tasks627313871.7344173442

    gain stage 0 to stage 1112290.147846452882183

    gain stage 1 to stage 22230.00432500828511371

    timeout5672551

    memory limit exceeded0NULL

    insert stmt100578821

    delete stmt1048871

    update stmt23912821

    merge stmt32721

    contains subquery5018291

    unnest failed2621091

    tables197374501.45986132960438

    hints330341

    order hint6851

    join hint141

    view reference11182211

    remote query88060781

    maximum DOP197374500.183442187313964

    maximum recursion level0NULL

    indexed views loaded0NULL

    indexed views matched0NULL

    indexed views used0NULL

    indexed views updated0NULL

    dynamic cursor request22221

    fast forward cursor request69701

    counteroccurrenceAvgValuenamevaluevalue_in_useDays Server UPTIME & Last Stats ResetStats WarningCost Threshold WarningRestart Warning

    final cost1973666346.00cost threshold for parallelism55205.08NULLCost threshold for parallelism might be too lowNULL

  • Ninja's_RGR'us (9/12/2011)


    I wouldn't take any decision based on those stats.

    The cost threshold is however at 1 with the default value being 5. This might be something I'd consider but I like to see ideally 1 full month of data. 2 days is just way to short for this.

    Is this server mostly olap or oltp?

  • Almost exclusive OLTP.

  • Bob Fazio (9/12/2011)


    Almost exclusive OLTP.

    Unless Grant has something else in mind I'd go with Cost Treshold at 25-35. I'm going more on the low end here to get more buffer for the "olap" queries which do benefit from parallelism.

    Another thing you might play with is actually giving more cpus to those olap queries. Maybe 8 instead of 4. the peak would be higher, but shorter and still leaving 40 other cores to work on the "normal stuff".

    However at this point I'd really like to be looking into this =>

    http://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-1

    http://www.simple-talk.com/sql/performance/finding-the-causes-of-poor-performance-in-sql-server,-part-2

    You could also corelate the times you have in the perform counters to those traces. At this point it's just a matter of find and tune. I really think you have more of a targeted problem with a few queries rather than a full server setup issue (especially if it ran ok for months).

  • This can take a guess at the avg, min & max cost of the olap queries.

    this uses the cached plan so it won't be 100% accurate but for guesstimate avg it might do the trick.

    This takes 1 min to process 8K plans on my prod server and it trashes the cpus while doing so. You might want to be around if you start this during the day in case users complain or run it at night (when they CAN'T complain :-D).

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

    WITH XMLNAMESPACES

    (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')

    SELECT

    query_plan AS CompleteQueryPlan,

    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS StatementText,

    n.value('(@StatementOptmLevel)[1]', 'VARCHAR(25)') AS StatementOptimizationLevel,

    n.value('(@StatementOptmEarlyAbortReason)[1]', 'VARCHAR(25)') AS StatementOptmEarlyAbortReason,

    n.value('(@StatementSubTreeCost)[1]', 'DECIMAL(18, 6)') AS StatementSubTreeCost,

    n.value('(QueryPlan/@CompileCPU)[1]', 'DECIMAL(18, 0)') AS CompileCPU,

    n.value('(QueryPlan/@CompileTime)[1]', 'DECIMAL(18, 0)') AS CompileTime,

    n.value('(QueryPlan/@CompileMemory)[1]', 'DECIMAL(18, 0)') AS CompileMemory,

    --Available only in the actual plan

    --n.value('(QueryPlan/@DegreeOfParallelism)[1]', 'DECIMAL(18, 0)') AS DegreeOfParallelism,

    n.query('.').exist('//MissingIndexes') AS HasMissingIndexes,

    n.query('.') AS ParallelSubTreeXML,

    ecp.usecounts,

    ecp.size_in_bytes

    INTO #X

    FROM sys.dm_exec_cached_plans AS ecp

    CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS eqp

    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)

    WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1

    --WHERE LOWER(n.value('(@StatementText)[1]', 'VARCHAR(4000)')) LIKE '%rpt_equipementsetaccessoires_bureaux%'

    SELECT * FROM #X

    SELECT

    --Works only with actual plans

    --SUM(usecounts * DegreeOfParallelism) / SUM(usecounts) AS Weigthed_Avg_DOP,

    SUM(usecounts * StatementSubTreeCost) / SUM(usecounts) AS Weigthed_Avg_SubTreeCost,

    MIN(StatementSubTreeCost) AS Min_StatementSubTreeCost,

    MAX(StatementSubTreeCost) AS Max_StatementSubTreeCost

    FROM

    #X

    --DROP TABLE #X

  • I appreciate the knowledge and the time. I will state that since the changes Friday, we have only 1 query that still runs in parallel and that query most definitely should continue to run in parallel.

    I intend to do a full review of our environment and try and identify area's of improvement. I intend to let the configuration sit a while (Unless the problem arises again) before I make any more changes to the system.

  • Bob Fazio (9/12/2011)


    I appreciate the knowledge and the time. I will state that since the changes Friday, we have only 1 query that still runs in parallel and that query most definitely should continue to run in parallel.

    I intend to do a full review of our environment and try and identify area's of improvement. I intend to let the configuration sit a while (Unless the problem arises again) before I make any more changes to the system.

    Ok, here's a nice 60 minutes blitz script when taking over a server.

    Of course it takes 2.5X that to explain it :-D.

    http://www.youtube.com/watch?v=40_Ilr-Mysk

  • There is also a patch we are strongly looking at. http://support.microsoft.com/default.aspx?scid=kb;EN-US;976700

    The reason why we are leaning towards the patch, is because I think everyone agrees, one bad query shouldn't take down a server of this (or really any) size.

    It's also included in SP1, which we might also apply.

  • Bob Fazio (9/12/2011)


    There is also a patch we are strongly looking at. http://support.microsoft.com/default.aspx?scid=kb;EN-US;976700

    The reason why we are leaning towards the patch, is because I think everyone agrees, one bad query shouldn't take down a server of this (or really any) size.

    It's also included in SP1, which we might also apply.

    Have you proven this is your problem?

    I promise you I have seem a SINGLE simple query take down entire servers. At least 3 times over 8 years.

    I'd really start with Gail's article & trace before guessing with patches.

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply