Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Question

Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

Viewing 15 posts - 151 through 165 (of 166 total)

You must be logged in to reply to this topic. Login to reply

The Wizard Of Oz SSC Eights! Points: 845 More actions · Answer 1

Jonathan AC Roberts (6/12/2014)
Normally when an example is given using values A, B and C they should be taken as algebraic variables who's contents can change.

Agreed, it gives a solution that can be used in many situations.

I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):

DECLARE @a CHAR(1), @b-2 CHAR(1), @C CHAR(1)

SELECT @a = 'X', @b-2 = 'F', @C = 'M'

SET NOCOUNT ON

SET STATISTICS IO ON

SET STATISTICS TIME ON

PRINT '

----- Jeff''s Original -----'

SELECT CustomerID

FROM #Purchase

WHERE ProductCode IN (@A, @b-2)

GROUP BY CustomerID

HAVING COUNT(DISTINCT ProductCode) = 2

EXCEPT

SELECT CustomerID

FROM #Purchase

WHERE ProductCode = @C

---------------------------------

PRINT '

----- NOT EXISTS -----'

SELECT CustomerID

FROM #Purchase P1

WHERE ProductCode IN (@A, @b-2)

AND NOT EXISTS (

SELECT 1

FROM #Purchase P2

WHERE ProductCode = @C

AND P1.CustomerID = P2.CustomerID

)

GROUP BY CustomerID

HAVING COUNT(DISTINCT ProductCode) = 2

---------------------------------

PRINT '

----- SUM(CASE...) -----'

SELECT CustomerID

FROM #Purchase P1

WHERE ProductCode IN (@A, @b-2, @C)

GROUP BY CustomerID

HAVING SUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0

AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0

AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

SET NOCOUNT OFF

Results (cleaned for readability):

----- Jeff's Original -----

Table '#Purchase'. Scan count 49960, logical reads 151394.

SQL Server Execution Times: CPU time = 406 ms, elapsed time = 412 ms.

----- NOT EXISTS -----

Table '#Purchase'. Scan count 3, logical reads 2160.

SQL Server Execution Times: CPU time = 296 ms, elapsed time = 288 ms.

----- SUM(CASE...) -----

Table '#Purchase'. Scan count 3, logical reads 2160.

SQL Server Execution Times: CPU time = 655 ms, elapsed time = 661 ms.

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 2

The Wizard Of Oz (6/13/2014)
I tested Jeff's original solution and 2 other solutions on a 10-million-row random table (using the code attached in Jeff's article):

They are not equivalent and return different result.

COUNT(DISTINCT ...) = 2

and

SUM(CASE WHEN ... THEN ... ELSE ... END) > 0

Will return different results.

N 56°04'39.16"
E 12°55'05.25"

The Wizard Of Oz SSC Eights! Points: 845 More actions · Answer 3

SwePeso (6/13/2014)
They are not equivalent and return different result.

Hmm, I'm going to be sceptical for now because I haven't found a case where any of the 3 solutions brought back different results 🙂

What settings did you use for @a, @b-2 and @C so I can replicate?

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 4

You don't need to.

The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.

The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.

N 56°04'39.16"
E 12°55'05.25"

The Wizard Of Oz SSC Eights! Points: 845 More actions · Answer 5

SwePeso (6/13/2014)
The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B.

Not sure about this, I thought the whole point of the DISTINCT is that it collapses any and all duplicates in the group, so it returns all groups having at least one A and at least one B, just like the SUM(CASE ...) approach.

SwePeso (6/13/2014)
The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.

Yes, I thought this was exactly what we needed?

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 6

So, depending on the distribution of the sample data, the two different queries will return different results.

The COUNT(DISTINCT ... ) will return a fewer number of rows, than the SUM(CASE ...) will.

N 56°04'39.16"
E 12°55'05.25"

The Wizard Of Oz SSC Eights! Points: 845 More actions · Answer 7

I don't think so, because before that point we've already done our GROUP BY CustomerID, so really the 2 methods we're comparing are:

...

WHERE ProductCode IN (@A, @b-2)

--people that bought A or B or AB

AND NOT EXISTS (

SELECT 1

FROM #Purchase P2

WHERE ProductCode = @C

AND P1.CustomerID = P2.CustomerID

)

--people that bought (A or B or AB) and (not C)

GROUP BY CustomerID

HAVING COUNT(DISTINCT ProductCode) = 2

--people that bought (AB) and (not C)

...

WHERE ProductCode IN (@A, @b-2, @C)

--people that bought (A or B or C or AB or AC or BC or ABC)

GROUP BY CustomerID

HAVINGSUM(CASE ProductCode WHEN @a THEN 1 ELSE 0 END) > 0

AND SUM(CASE ProductCode WHEN @b-2 THEN 1 ELSE 0 END) > 0

AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0

--people that bought (AB) and (not C)

Either I need more coffee today (99% likely), or you do Peso (1%).

I'm eager to learn more about SQL either way 🙂

SQL Simmo SSC Enthusiast Points: 132 More actions · Answer 8

I'm finding that the results are the same for all 3 queries regardless of the input parameters and, as far as I can see, logically they should be.

SSCrazy, Can you supply some example data to highlight what you are saying here?

I'm pretty sceptical as well and want some evidence that what you are saying is correct.

Ed Wagner SSC Guru Points: 287015 More actions · Answer 9

As always, an excellent article, Jeff. Then again, we've come to expect nothing less.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 10

No. You are right.

I am the one needing more coffee. Just a mind lapse and disregarding the DISTINCT for some reason.

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

Ed Wagner (6/13/2014)
As always, an excellent article, Jeff. Then again, we've come to expect nothing less.

Thanks, Ed. Better than the article, though, look at the great discussions going on. Lotsa good people with good ideas. That's why I love this place. Ya just gotta love this community!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

ColdCoffee SSC-Dedicated Points: 39971 More actions · Answer 12

Do we have a readymade 10 million/1million row generator for this problem?

ColdCoffee SSC-Dedicated Points: 39971 More actions · Answer 13

Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂

; WITH CTE AS

(

SELECT P.CustomerID ,P.ProductCode

, CODE = CASE WHEN ( P.ProductCode = 'B'

AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'

AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1

ELSE 0

END

FROM #Purchase P

WHERE P.ProductCode in ('A','B', 'C')

GROUP BY P.CustomerID ,P.ProductCode

)

SELECT CustomerID

FROM CTE

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

ColdCoffee (6/13/2014)
Now that we have new analytical functions, just wanted to jump in on the new windowed functions fun 🙂
; WITH CTE AS
(
SELECT P.CustomerID ,P.ProductCode
, CODE = CASE WHEN ( P.ProductCode = 'B'
AND LAG (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) = 'A'
AND LEAD (P.ProductCode, 1, 0) OVER (PARTITION BY P.CustomerID ORDER BY P.ProductCode) <> 'C' ) THEN 1
ELSE 0
END
FROM #Purchase P
WHERE P.ProductCode in ('A','B', 'C')
GROUP BY P.CustomerID ,P.ProductCode
)
SELECT CustomerID
FROM CTE

What's the performance look like compared to the other methods?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 15

ColdCoffee (6/13/2014)
Do we have a readymade 10 million/1million row generator for this problem?

Heh... apparently, you didn't read the article where is says... 😉

The code above isn't adequate for performance testing. For those that want to explore and compare solutions of their own, I've attached code to build a million row test table at the bottom of this article in the "Resources" link.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions