SQL Decides to NOT Use My DISTINCT Clause. Why?

  • -- This demonstrates how SQL server's decision to not use your DISTINCT

    -- clause can affect your performance greatly.

    set nocount on

    create table #test (partID int, partDescription varchar(100), otherID int)

    -- Populate with a bunch of stuff for otherID = 100

    insert into #test(partID, partDescription, otherID) values(0, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(1, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(2, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(3, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(4, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(5, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(6, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(7, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(8, 'My Part', 100)

    insert into #test(partID, partDescription, otherID) values(9, 'My Part', 100)

    -- That's 10.

    insert into #test(partID, partDescription, otherID)

    select A.partID, A.partDescription, A.otherID from #test A, #test B

    insert into #test(partID, partDescription, otherID)

    select A.partID, A.partDescription, A.otherID from #test A, #test B

    -- select count(*) from #test 12210

    -- Now some more data

    -- Populate with a bunch of stuff for otherID = 300

    -- Skip partID 5

    insert into #test(partID, partDescription, otherID)

    select partID, partDescription, 300D from #test where partID <> 5

    -- Check the difference in execution plans

    -- takes 5 seconds to run

    -- Ignores inner distincts effectively joining 12K rows to 12K rows

    select distinct A.partID from

    (

    select distinct partID from #test where otherID = 100

    ) A left join

    (

    select distinct partID from #test where otherID = 300

    ) B on A.partID = B.partID

    WHERE B.partID is null

    -- Respects inner distincts effectively joining 10 rows

    -- takes 0ms to run

    select A.partID from

    (

    select distinct partID from #test where otherID = 100

    ) A left join

    (

    select distinct partID from #test where otherID = 300

    ) B on A.partID = B.partID

    WHERE B.partID is null

    Is this a documented feature? Could this creep up if I used views instead of derived queries?

    Thanks

    God Bless

    ST

  • DISTINCT is not a silver bullet.

    It's quite opposite.

    I'd suggest not to use it in production code at all. Ever.

    DISTINCT/GROUP BY is applied after all joins and WHERE clause.

    If you have 2 or more DISTINCTS the topmost only will be left by optimiser because all others are useless.

    Don't use construction LEFT JOIN + WHERE Key IS NULL.

    Use WHERE NOT EXISTS instead.

    Compare performance of this one:

    select distinct A.partID from

    (

    select distinct partID from #test where otherID = 100

    ) A

    WHERE NOT EXISTS ( select partID from #test B where otherID = 300 and A.partID = B.partID )

    _____________
    Code for TallyGenerator

  • I have become aware that my DISTINCT is ignored. My question is why? Is this by design? My real world data contains 100K+ records in each side of the search (represented by tables A and B in this example). The DISTINCT on each side was intended to reduce the amount of data the server had to search through to find the answer. Effectively it reduced it to roughly 100 records per side.

    In production I tried the following:

    WHERE NOT EXISTS (sub query)

    Multiple distincts (as demonstrated in the example)

    No distincts at all

    Inserting into 2 temp tables the results from the left and right queries, then performing a left join on the temp tables.

    I found that using distinct on the left and right queries AND the temp table strategy performed at nearly the same performance. The advantage of the temp tables was that I could use an additional distinct on the final query. This is necessary in my production environment.

    Why I found this example troubling is that it demonstrates a condition that could happen in the real-world; where 2 different developers may write views using distinct (with good reason). A third developer could come in and use those views with an additional distinct clause and the performance would go through the floor. Similarly I can imagine this could also happen when combining multiple views with aggregate functions, such as SUM and AVG, and group by statements.

    I saw nothing in books online stating that multiple distincts are ignored.

    Thanks

    ST

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply