Issue with singleton lookup sporadically taking far too long

  • Hey all, got an odd issue we've been working to track down. Here's the summary, followed by some supporting detail. Hopefully someone can help make sense out of why this is happening.

    Summary

    We've got a simple stored procedure that is passed a varchar(20) for a consumer ID, and it outputs a BIGINT with the internal id for that consumer ID. The procedure first performs a query on a unique clustered index which should be a singleton lookup to see if the consumer ID exists - if it does, we output and return. For the singleton lookup perhaps out of every 2000 executions all will perform in 5ms or less, but there will be perhaps 5 outliers that bust the SLA requirement and take as long as 700ms.

    Details

    Some things I've reviewed or steps taken to resolve that haven't panned out:

    1. Tried various variations of FORCESEEK hint

    2. Disabled parameter sniffing via local variable reassignment

    3. REORGANIZED, REBUILD, and also completely DROPPED & CREATED indexes anew

    4. Reviewed locking - I'm not seeing any locking blocking this activity

    5. Reviewed index stats and it's showing no scans having occurred on the indexes on this table

    6. From our application logs I can tie back exactly when this happens to the exact call made to the procedure, rerunning that exact same statement through SSMS or any other methodology yields acceptable results, we can't reproduce the issue we're seeing

    7. We're run perfmon and monitored activity on the server and SQL Server and cannot find any breadcrumbs of unusual activity that would explain this (although when it does happen there appears to be a tiny drop in buffer cache hit ration and a tiny increase in disk queue length, but not enough that would indicate I/O throughput problems or contention)

    8. Reviewed sys.dm_exec_query_stats thoroughly, based on the numbers I don't see where any scans are occurring

    Some interesting numbers from sys.dm_exec_query_stats (not sure how to paste this and have it be more readable, apologies):

    total_elapsed_timemax_elapsed_timeave_time

    72857586511434116

    last_logical_writestotal_logical_readsmax_logical_reads max_logical_writes

    0609630

    last_physical_readsmax_physical_reads

    09

    total_worker_timeave_worker_timemin_worker_timemax_worker_time

    15361675239834

    total_rowsmin_rowsmax_rows

    201301

    I even have the XML portion of the plan this uses, let's see if I can make this show up ok:

    <StmtSimple StatementText="------------------------------------------------------------------------------- -- GET_CONSUMER_ID -- A stored procedure used to either locate an existing CONSUMER, or if an -- existing CONSUMER is not found, create a new CONSUMER record. ------------------------------------------------------------------------------- CREATE PROCEDURE [dbo].[GET_CONSUMER_ID] ( @LOYALTYIDENTIFIER VARCHAR(20), @ID_CONSUMER BIGINT OUT ) AS BEGIN SET NOCOUNT ON; -- Disable parameter sniffing " StatementId="1" StatementCompId="2" StatementType="SET ON/OFF" />

    <StmtSimple StatementText=" DECLARE @MYLOYALTYIDENTIFIER VARCHAR(20); DECLARE @MYID_CONSUMER BIGINT; SET @MYLOYALTYIDENTIFIER = @LOYALTYIDENTIFIER; -- See if the CONSUMER already exists. " StatementId="2" StatementCompId="3" StatementType="ASSIGN" />

    <StmtSimple StatementText=" SELECT @MYID_CONSUMER = c.ID_CONSUMER FROM CONSUMERS c WITH (FORCESEEK (UNI2K_CONSUMERS (CONSUMERIDENTIFIER))) WHERE c.CONSUMERIDENTIFIER = @MYLOYALTYIDENTIFIER; -- The CONSUMER exists, so no need to define them. " StatementId="3" StatementCompId="4" StatementType="SELECT" StatementSubTreeCost="0.0032831" StatementEstRows="1" StatementOptmLevel="TRIVIAL" QueryHash="0xD0FF8631719B6238" QueryPlanHash="0xA7FA19553C9D5AC2">

    <StatementSetOptions QUOTED_IDENTIFIER="true" ARITHABORT="false" CONCAT_NULL_YIELDS_NULL="true" ANSI_NULLS="true" ANSI_PADDING="true" ANSI_WARNINGS="true" NUMERIC_ROUNDABORT="false" />

    <QueryPlan CachedPlanSize="16" CompileTime="0" CompileCPU="0" CompileMemory="112">

    <RelOp NodeId="0" PhysicalOp="Clustered Index Seek" LogicalOp="Clustered Index Seek" EstimateRows="1" EstimateIO="0.003125" EstimateCPU="0.0001581" AvgRowSize="15" EstimatedTotalSubtreeCost="0.0032831" TableCardinality="8.09488e+006" Parallel="0" EstimateRebinds="0" EstimateRewinds="0">

    <OutputList>

    <ColumnReference Database="[PROD]" Schema="[dbo]" Table="[CONSUMERS]" Alias="[c]" Column="ID_CONSUMER" />

    </OutputList>

    <IndexScan Ordered="1" ScanDirection="FORWARD" ForcedIndex="1" ForceSeek="1" ForceSeekColumnCount="1" ForceScan="0" NoExpandHint="0">

    <DefinedValues>

    <DefinedValue>

    <ColumnReference Database="[PROD]" Schema="[dbo]" Table="[CONSUMERS]" Alias="[c]" Column="ID_CONSUMER" />

    </DefinedValue>

    </DefinedValues>

    <Object Database="[PROD]" Schema="[dbo]" Table="[CONSUMERS]" Index="[UNI2K_CONSUMERS]" Alias="[c]" IndexKind="Clustered" />

    <SeekPredicates>

    <SeekPredicateNew>

    <SeekKeys>

    <Prefix ScanType="EQ">

    <RangeColumns>

    <ColumnReference Database="[PROD]" Schema="[dbo]" Table="[CONSUMERS]" Alias="[c]" Column="CONSUMERIDENTIFIER" />

    </RangeColumns>

    <RangeExpressions>

    <ScalarOperator ScalarString="[@MYLOYALTYIDENTIFIER]">

    <Identifier>

    <ColumnReference Column="@MYLOYALTYIDENTIFIER" />

    </Identifier>

    </ScalarOperator>

    </RangeExpressions>

    </Prefix>

    </SeekKeys>

    </SeekPredicateNew>

    </SeekPredicates>

    </IndexScan>

    </RelOp>

    </QueryPlan>

    </StmtSimple>

    Here's the version we're on:

    select @@VERSION

    Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64)

    Jun 17 2011 00:54:03

    Copyright (c) Microsoft Corporation

    Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)

    In short it simply looks like every so often worker_time skyrockets for this small singleton lookup, and we cannot explain why. I've been looking through the doc for SQL Server 2008 R2 SP1 CU's 1-4 and I don't see anything that fits this problem. However, after reviewing this from many angles with little progress it's got me looking into more obscure/fringe explanations, such as wondering if this has anything to do with linear regression vs binary search and it's impact in x64 environments. But to be honest I'm hoping there's a more simple/rational reason or rock I've not overturned that someone can help point me towards.

  • I don't have a direct cause to the issue. Btw, might I say an excellent presentation of the issue and what you've tried? 🙂

    There's a couple of places I could see an issue occurring, but as you mention, they're sporadic. First, review your SAN usage and where the LUNs are. Confirm you're not dealing with a competing software on another LUN on the same physical spindles. Next would be network hiccups, that the stream isn't being interrupted long enough to trigger alarms but is affecting you here. Finally I'd confirm that your memory isn't heading to the swapfile, which could cause occassional slowdowns for certain chunks of information.

    Now, for an immediate workaround, build yourself a non-clustered index containing just the VARCHAR(20) lookup and include the ID that you're trying to return. Even if it copies the leading edge of your clustered index, this will remove any subsequent data that you don't need to optimize this particular query. That particular index should be able to stay memory-resident and avoid the majority of round trips that might be necessary for the rest of the data.

    EDIT: I should mention that you're not talking about huge spasms of time here. Yes, the ratio is amazingly different but you've got high throughput of information from the description you've given. Small bumps are exactly what you need to be inspecting when you're looking into tiny, rare discrepencies like you've described. You're still sub-second, so you won't see anything 'big' happen in your general perfmon stats.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Now, for an immediate workaround, build yourself a non-clustered index containing just the VARCHAR(20) lookup and include the ID that you're trying to return. Even if it copies the leading edge of your clustered index, this will remove any subsequent data that you don't need to optimize this particular query. That particular index should be able to stay memory-resident and avoid the majority of round trips that might be necessary for the rest of the data.

    That brings up a good point but to clarify - they way it's structured today, the BIGINT we're trying to pull back via querying the varchar(20) Unique Clustered Index will require a trip to disk? And by using a Unique Non-Clustered including the BIGINT will give it the chance to pull from memory instead?

    Interesting you mention this - it's one thing we've considered but haven't tried yet, but since this table changes infrequently adding one more index is doable. Another option that went up on the drawing board was removing the Unique Clustered Index from the varchar(20) field and applying it instead to the PK which is the BIGINT that is returned, and instead have a Non-Clustered Unique Index on the varchar(20) field. This why the cluster is included in the index automatically without having to include it as additional storage. But based upon the way the table is used the consensus is we're trying to keep the clustered index as it is today due to how that column is used.

    But all that aside, a singleton lookup on a Unique Clustered Index should always be lightning as long as it's not being blocked (which it's not, and FYI we're also using RCSI as well). But if changing our indexing strategy is the answer to an unexplainable outlier behavior, we can live with that. Will do some more research. In the mean time, any additional thoughts are appreciated.

  • Btw, might I say an excellent presentation of the issue and what you've tried? 🙂

    And thanks!:-D

  • Brian.cs (12/29/2011)


    That brings up a good point but to clarify - they way it's structured today, the BIGINT we're trying to pull back via querying the varchar(20) Unique Clustered Index will require a trip to disk? And by using a Unique Non-Clustered including the BIGINT will give it the chance to pull from memory instead?

    I should clarify. This will depend on the size of the table in existance as it stands today, and how much memory is allowed for SQL Server, and how 'hot' the pages remain before the lazy writer writes them down to the disk to make room for 'hotter' data.

    So, a large table with a lot of data on each record makes for more leaves which takes up more room. A thin index that's constantly battered will remain in memory easier, thus reducing the number of times you have to go to disk for any information that's fallen out of memory. It all has to do with page mechanics inside SQL Server as it tries to regain memory for newer queries and what it decides to drop back to disk.

    The only thing this can help is how 'wide' an area you need to keep this lookup viable. If the table is nothing more than your VARCHAR(20) and the BIGINT, it will do you absolutely no good. If there's a lot of trailing information, it can help avoid round trips. It won't completely avoid them, but it'll help to do so.

    But all that aside, a singleton lookup on a Unique Clustered Index should always be lightning as long as it's not being blocked (which it's not, and FYI we're also using RCSI as well).

    Lightning-speed will always depends on external factors. If you're testing in a quiet system on quiet drives with no contention anywhere and all data already loaded to cache without anything else looking to work with the memory, I'll agree.

    But if changing our indexing strategy is the answer to an unexplainable outlier behavior, we can live with that. Will do some more research. In the mean time, any additional thoughts are appreciated.

    It's not necessarily that it's the answer, what it will do, hopefully, is remove a few variables from the outlyer possibilities, by helping to keep that data in cache. You mentioned a slight bump in cache miss rates earlier, my guess is this has to do with pages going cold and needing to be recovered on occassion. This index should (hopefully) both be faster to recover from disk when needed and will be more likely to stay memory-resident because of its thin-ness and higher hit rates/page, keeping the page 'hotter' to the lazy writer.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Some general comments:

    I agree with Craig that the current indexing arrangement is probably not optimal for these lookups. A narrow unique index on the loyalty identifier that (implicitly or explicitly) includes the consumer id may be better. The narrow index would require fewer pages than the present clustered index (which 'includes' all INROW columns at the leaf level). You might even consider page compressing the new index if that results in a useful page count reduction. This (and using the index frequently) is about the best you can do to try to keep its pages in buffer pool. The max_physical_reads value of 9 suggests that lookups do occasionally have to wait for pages to come in from disk.

    Another possible cause is having to traverse a long tempdb version chain to find the correct version of the row for RCSI if there is a long-running concurrent transaction that has generated versions for the table in question. If you can tolerate reading data from this table that has not been committed yet (assuming rollbacks are rare) I can see no reason not to perform this singleton lookup at the READUNCOMMITTED isolation level. If that is not preferable, you could also perform the lookup at a higher isolation level that again would not use the version store.

    Strictly, there is no reason to use the local variable trick to avoid sniffing, or the index hint, since the plan is currently a Trivial one. A lookup supported by a covering unique index (even if there is more than one!) is overwhelmingly likely to generate a Trivial plan (edge case exclusions including systems where the cost threshold for parallelism is set lower than the trivial plan cost). Overall, I would probably replace the local variable trick with the OPTIMIZE FOR UNKNOWN hint, and add a KEEPFIXED PLAN hint to avoid recompilations for optimality reasons (e.g. when statistics on the table are updated)...something like:

    SELECT

    @ID_CONSUMER = c.id_consumer

    FROM dbo.Consumers AS c WITH (READUNCOMMITTED /* or maybe REPEATABLEREAD*/)

    WHERE

    c.ConsumerIdentifier = @LOYALTYIDENTIFIER

    OPTION (KEEPFIXED PLAN, OPTIMIZE FOR UNKNOWN)

    Seeing as response time is more important than plan optimality in your system, you might want to double-check that asynchronous statistics updates are enabled. Last thing, SQL Server only uses linear interpolation on indexes with a numeric key type (see my blog post).

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply