NOT EXISTS vs NOT IN

  • I have just re-written a query that was using NOT IN to use NOT EXISTS and the query returns the data in 1 second.

    The NOT IN was returning the data in 10 seconds.

    Why does NOT EXISTS perfrom better than NOT IN?


    Kindest Regards,

  • Let's think this through.

    If you need a NOT EXISTS query, you are using the index (presumably) to check for the existence of a record specified by a single value. As soon as you find the value in the index lookup, you are done. Even if the field is not indexed, the table scan is aborted as soon as you find a match. Overhead: Low

    If you use a NOT IN query, a set of matching values must be built, and then the test value is searched in that set. In other words, you execute a complete SELECT statement, and then use the results of that statement to validate your data. When using a non-indexed value, instead of completing as soon as a matching value is found, all different values on file are read into a set, and then the data is matched against the set. You will be manipulating more data, and performing more reads on your data. Overhead: High

  • Brendt,

    Excellent stuff! I now understand why.

    So, obviously the query optimizer chooses a different execution plan & method between NOT EXISTS & NOT IN?


    Kindest Regards,

  • Yes, it does, but it does so primarily because the two statements should be used for differing reasons.

    -- WARNING -- PERSONAL OPINION BELOW --

    EXISTS / NOT EXISTS checks are best used when querying to see if a value has a match on file.

    IN / NOT IN checks are best used when comparing set A to set B. And since NOT IN can accept a list of constants, it is useful for excluding a limited, predefined list of items (say, specific customers that get special treatment). IN / NOT IN should not, IMHO, be used nearly as often as they are currently used.

  • Ok thanks Brendt.


    Kindest Regards,

  • nice one brendt 

    "Keep Trying"

  • Probably Brendt is right in this case.  However, I am dubious whether your explanation and assumptions can be generalized quite so easily.  It seems to me that whether IN or EXISTS performs better will depend on the particular query and the tables/indices upon which it is performed.  Consider the following:

    (A) select * from tab1 where fld1 not in (select fld2 from tab2)

    (B) select * from tab1 where not exists (select * from tab2 where fld2 = tab1.fld1)

    As I understand the processing of these statements, (A) will behave pretty much like an outer join, i.e., a temp table will be built to hold the fld2 values from tab2, which are then compared against fld1 values in tab1, tossing out result rows where a match is found.  On the other hand, (B) will need to evaluate its subquery iteratively for every row extracted from tab1.  Which alternative will work better will depend on the relative size of tab1 and tab2 and also the various indices available to assist in the matching.  For instance, if tab1 and tab2 are both large, and if fld2 isn't indexed, wouldn't EXISTS likely be a far worse alternative to IN?

    Or am I misunderstanding how all this works?

    Thanks,

  • Actually, I believe you are correct in general. As I noted above, an IN/NOT IN is usually better for comparing sets of data, while an EXISTS / NOT EXISTS is better for seeing if a specific element is in a given set of data.

    So, if you are SELECTing data from Table1 where a matching row is not found in Table2, a NOT IN is better than a NOT EXISTS, because you are working with two sets of data. (Note that I have usually found a LEFT JOIN / IS NULL construct works even better than NOT IN in this case).

    But if you add a row to a table conditionally based on the existence / nonexistence of some value in some data table, then EXISTS will be your better choice.

    Also, there are a few set-based specific conditions where EXISTS is more efficient than IN - usually where one of the two sets contains only a few elements.

  • >>So, if you are SELECTing data from Table1 where a matching row is

    >>not found in Table2, a NOT IN is better than a NOT EXISTS

    Not true.

    What if the unique identifier for matching rows is a 2 or more column composite key ?

    What are you going to do, cast all columns to varchar, concatenate them and do a NOT IN on the concatenated values ?

    Been there, seen that, fixed the performance mess & spanked the programmers who coded it for killing our server.

    What if Table2 has a perfectly good index on the column or columns being checked against ?

    NOT IN will never index seek on the table in the sub-query, you're throwing away the opportunity for a faster solution.

    The answer to which is best, is the typical SQL Server answer. "It Depends".

    Depends on data volumes, indexes, index selectivity.

     

  • Ken C., you are incorrect. When using Not In Or In with a subquery, SQL Server does not execute the subquery and store the dataset for comparison against the main query. The subquery is re-evaluated for every record.

    By the way, SQL Server 2005 is smart enough to execute both mathods the same way so that there is no performance difference.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Robert, I will check this out sometime by coding some examples and checking the query plan, but must admit I am dubious.  I can see that your assertion would be right IF the subquery referred back to field values in the main query--e.g., if your query was something like the following:

    select * from tab1 where fld1 in  (select fld2 from tab2 where fld3 = tab1.fld4)

    In this case naturally the subquery has to be evaluated on a row-by-row basis since the subquery results depend on the individual row value of fld4 in the main query.

    However, in the case I made above, where the subquery is independent of the main query, I would be really surprised if the optimizer couldn't recognize that circumstance and act accordingly.

    Interesting stuff, I must say.

  • What the optimizer recognizes is that the table(s) involved in the subquery could have changes committed during the execution of the query and therefore the result set could be different for each record in the parent query.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply