Why does this WHERE condition take forever to execute?

  • This is probably basic stuff, but table1 (~50 000 rows) has an integer field (SomeIntVal) that references the identity field in table2 (~50 000 rows). Integrity is not enforced so values in table1.SomeIntVal may be NULL or non existing in table2.id.

    I want to query all the "dirty" rows in table1:

    SELECT count(*)

    FROM table1

    WHERE SomeIntVal IS NULL OR SomeIntVal NOT IN (SELECT id FROM table2)

    Why does this query take forever to execute?

    If I query either of the conditions alone it executes in less than a second.

    SELECT count(*)

    FROM table1

    WHERE SomeIntVal IS NULL

    SELECT count(*)

    FROM table1

    WHERE SomeIntVal NOT IN (SELECT id FROM table2)

    A full example below, this one took about 7 seconds for the last query

    --Create a table

    CREATE TABLE #table1 (ID INT IDENTITY,sparsekey INT)

    --Create a simple lookup table

    CREATE TABLE #table2 (ID INT IDENTITY,someval INT)

    GO

    --Populate table2

    INSERT INTO #table2

    SELECT 1

    UNION ALL

    SELECT 2

    UNION ALL

    SELECT 3

    GO

    INSERT INTO #table2

    SELECT a.someval FROM #table2 a, #table2 b, #table2 c

    ORDER BY c.someval, b.someval, a.someval

    GO

    INSERT INTO #table2

    SELECT a.someval FROM #table2 a, #table2 b, #table2 c

    ORDER BY c.someval, b.someval, a.someval

    GO

    --Populate the field sparsekey in table1 with sparse values that sometimes references #table2

    INSERT INTO #table1

    SELECT TOP 2000 CASE someval

    WHEN 1 THEN NULL

    WHEN 2 THEN 0

    ELSE ID

    END

    FROM #table2

    GO

    DECLARE @StartTime datetime

    --Get the count of sparsekey values that are NULL

    SET @StartTime = GETDATE()

    DECLARE @RetVal int

    SELECT @RetVal = count(*) FROM #table1

    WHERE sparsekey IS NULL

    SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q1

    --Get the count of sparsekey values that are NOT IN #table2

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM #table1

    WHERE sparsekey NOT IN (SELECT ID FROM #table2)

    SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q2

    --Get the count of sparsekey values that are NULL OR NOT IN #table2

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM #table1

    WHERE sparsekey IS NULL

    OR sparsekey NOT IN (SELECT ID FROM #table2)

    SELECT @RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    GO

    DROP TABLE #table1

    DROP TABLE #table2

    GO

  • Try changing your where clause to use an exists

    WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)

    hope you have SomeIntVal indexed.

  • There you go. I knew there was a simple solution to this.

    I guess I could also drop the first condition here as it will be redundant if id in table2 is NOT NULL

    WHERE SomeIntVal IS NULL OR SomeIntVal NOT EXISTS(SELECT 1 FROM FROM table2 WHERE id = SomeIntVal)

    Thanx.

  • Using a left outer join will improve your performance. I added the NOT EXISTS and OUTER JOIN methods to your code.

    --Get the count of sparsekey values that are NULL OR NOT IN #table2 using IN

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM #table1

    WHERE sparsekey IS NULL

    OR sparsekey NOT IN (SELECT ID FROM #table2)

    SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    --Get the count of sparsekey values that are NULL OR NOT IN #table2 using NOT EXISTS

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM #table1

    WHERE sparsekey IS NULL

    OR NOT EXISTS(SELECT 1 FROM #table2 WHERE id = sparsekey)

    SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    --Get the count of sparsekey values that are NULL OR NOT IN #table2 using left outer join

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*)

    FROM #table1 as a

    LEFT OUTER JOIN #table2 as b

    ON sparsekey = b.ID

    WHERE sparsekey IS NULL OR b.ID IS NULL

    SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    results here are as follows:

    method nrows ExecutionTime_Q3

    --------------------- ----------- ----------------

    Using IN 1990 8076

    method nrows ExecutionTime_Q3

    --------------------- ----------- ----------------

    Using NOT EXISTS 1990 126

    method nrows ExecutionTime_Q3

    --------------------- ----------- ----------------

    Using LEFT OUTER JOIN 1990 13

  • There's no question that using IN is the wrong option.

    However, the choice between left outer join and the not exists clause appears a bit less 'cut and dry', IMHO. I suspect one method will be favorable under some conditions, and the other better in other conditions.

    When I run the below (slightly modified version of your) script, using real tables with the ID's set to be primary keys (and thus indexed), and clearing the caches (like I was always told one should do) between running 'benchmarks' of this nature, I found the execution time of Using IN to run about 4600ms, whilst both of the other two options ran instantaneously (i.e. either 0 or 16ms).

    Something interesting about this whole thing is ... If you turn on Display Actual Execution Plan, you'll discover that the Not In and Not Exists methods both end up using the exact same execution plan, yet our tests show that the former is orders of magnitude slower.

    Also, strangely, if you add up the percentages on Cost %'s on both of these two methods, they add up to like 200% ... in fact, just the clustered index seek ALONE shows as cost: 100%. Whereas, in the AEP of the LOJ method, the percentages add to 100%, like they normally do.

    Another strange thing is that in terms of query cost relative to the batch, the Not In and Not Exist method show up as 14% each, and the LOJ method relative cost is 72%...

    CREATE TABLE table1 (ID INT IDENTITY,sparsekey INT, primary key(id))

    --Create a simple lookup table

    CREATE TABLE table2 (ID INT IDENTITY,someval INT, primary key(id))

    --Populate table2

    INSERT INTO table2

    SELECT 1

    UNION ALL

    SELECT 2

    UNION ALL

    SELECT 3

    INSERT INTO table2

    SELECT a.someval FROM table2 a, table2 b, table2 c

    ORDER BY c.someval, b.someval, a.someval

    INSERT INTO table2

    SELECT a.someval FROM table2 a, table2 b, table2 c

    ORDER BY c.someval, b.someval, a.someval

    --Populate the field sparsekey in table1 with sparse values that sometimes references table2

    INSERT INTO table1

    SELECT TOP 2000 CASE someval

    WHEN 1 THEN NULL

    WHEN 2 THEN 0

    ELSE ID

    END

    FROM table2

    GO

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    --Get the count of sparsekey values that are NULL OR NOT IN table2 using IN

    DECLARE @StartTime datetime, @retval int

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM table1

    WHERE sparsekey IS NULL

    OR sparsekey NOT IN (SELECT ID FROM table2)

    SELECT 'Using IN ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    GO

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    --Get the count of sparsekey values that are NULL OR NOT IN table2 using left outer join

    DECLARE @StartTime datetime, @retval int

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*)

    FROM table1 as a

    LEFT OUTER JOIN table2 as b

    ON sparsekey = b.ID

    WHERE sparsekey IS NULL OR b.ID IS NULL

    SELECT 'Using LEFT OUTER JOIN',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    GO

    DBCC DROPCLEANBUFFERS

    DBCC FREEPROCCACHE

    --Get the count of sparsekey values that are NULL OR NOT IN table2 using NOT EXISTS

    DECLARE @StartTime datetime, @retval int

    SET @StartTime = GETDATE()

    SELECT @RetVal = count(*) FROM table1

    WHERE sparsekey IS NULL

    OR NOT EXISTS(SELECT 1 FROM table2 WHERE id = sparsekey)

    SELECT 'Using NOT EXISTS ',@RetVal As nrows, DATEDIFF(millisecond, @StartTime, getdate()) AS ExecutionTime_Q3

    GO

    DROP TABLE table1

    DROP TABLE table2

    GO

  • [font="Verdana"]Welcome to the joys of execution plans. :w00t:

    From experience, the left outer join always tend to perform at least as well as the not exists. So I always tend to write that statement now as a left outer join.

    I also know that the SQL Server optimiser does some smart things with optimising common forms of "not in" and "not exists", and can potentially generate much the same query plan for these. However, I read up on this a couple of years ago (with the introduction of SQL Server 2005) and I'm not exactly sure what happens there any more.

    My recommendation: stick with the left outer join. It's hard to beat.

    [/font]

  • Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:

    SELECT e.EmpID, s.nvcProdName

    FROM t_Employee e

    LEFT OUTER JOIN t_Sale s

    ON e.EmpID = s.EmpID

    AND COALESCE( s.nvcProdName, '' ) <> ''

    Max

  • [font="Verdana"]Does that actually work?

    I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.

    I might be able to eliminate a few where clauses!

    [/font]

  • Bruce W Cassidy (2/3/2009)


    [font="Verdana"]Does that actually work?

    I know in Oracle, if you tried to put the filter criteria for the right hand side being empty into the ON clause, it didn't work as expected.

    I might be able to eliminate a few where clauses!

    [/font]

    I hope so, frowned upon by some WHERE clause purists though >:(

    Please let me know more about the Oracle rhs problems, otherwise I may have to eliminate a few join conditions. :w00t: :hehe:

    Max

  • [font="Verdana"]Oh, who knows. I haven't played with Oracle for a couple of years now, so maybe the best thing is to do the old "suck it and see" approach. I'll try putting the filter criteria into my join conditions on SQL Server. :cool:[/font]

  • Hi Bruce,

    Gints wrote some informative tests on join criteria: http://www.gplivna.eu/papers/sql_join_types.htm#p6.4, thanks Gints, this does help clarify a lot.

    Max

  • Max (2/3/2009)


    Yip, second (or third) the use of joins. The WHERE clause can also be omitted by placing the condition in the join too. May I also suggest catering for presumed NULLS:

    SELECT e.EmpID, s.nvcProdName

    FROM t_Employee e

    LEFT OUTER JOIN t_Sale s

    ON e.EmpID = s.EmpID

    AND COALESCE( s.nvcProdName, '' ) <> ''

    The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.

  • Jack Corbett (2/17/2009)


    The only problem with this is that now the query has to scan every row in t_Sale to apply the coalesce.

    Dinkum? Sorry about the slang - and no I'm not an aussie - it seems somehow appropriate though and will check it out. Every row, not only those being returned in the join? Do you know off-hand what the cost comparison would be vs. applying the same formatting to the WHERE clause, presuming, of course, that one is not diametrically opposed to checking for NULLS and therefore making the point moot?

    Max

  • Max, I don't think it's possible for the query engine to 'know' what 'only the rows returned by the join' are ... considering that the comparison in question is a part OF the join.

    This being said, I also don't think it'd make any difference if you moved the comparison to the WHERE clause either. I should think the execution plan would remain the same, IOW.

    I think the source of the issue Jack has mentioned is actually due to the use of a Function in one side of the comparison, which would cause indexes to not be used ...

    Just a guess on my part, though 😉

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply