help with grouping/summarizing data

Question

Post reply

help with grouping/summarizing data

Mr Corn Man

Ten Centuries

Points: 1268
More actions
July 26, 2011 at 10:31 am

#241452

I have a summary table that I need to extract some additional summarized/grouped data out of.
The table contains the following columns being used:
YYYYMM,
P_TYPE, varchar(50)
P_NBR, varchar(30)
P_DESC, varchar(50)
P_USERS, numeric(9,0)
P_CONTACTS, numeric(9,0)
P_CONTACTS_USER, numeric(5,2) -->calculated value of P_CONTACTS/P_USERS
Within the data, there are multiple entries with the same P_DESC value but different P_NBR values. These need to be grouped in the query results by the P_DESC, not the P_NBR. The P_NBR is a unique number that may share a P_DESC with other records. For example:
YYYYMMP_TYPEP_NBRP_DESCP_USERSP_CONTACTSP_CONTACTS_USER
201106ind fruita1234apples102.20
201106ind fruita4321apples581.60
201106bag fruitb9876oranges74.57
201106ind fruitb7890apples2012.60
In the example above, all the data for the P_DESC of 'apples' would need grouped and summarized as such:
YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER
201106ind fruitapples3522.63
I can get it to the point the data gets summarized, but I need to pull in ONLY THE TOP 10 RECORDS PER MONTH based on the P_CONTACTS_USER value (I house 13 months of data so in the end, there will be 130 records returned) AND the summed total number of P_USERS must be at least 25,000 to be included.
When I try to add in row numbers it screws up and starts splitting the data out by P_DESC instead of keeping it grouped by P_DESC.
Here is the query that is working to get it grouped/summed:
SELECT
YYYYMM,
P_TYPE,
P_DESC,
SUM(P_USERS) AS P_USERS,
SUM(P_CONTACTS) AS P_CONTACTS,
CAST(((SUM(P_CONTACTS))/(SUM(P_USERS))) AS NUMERIC(5,2)) AS P_CONTACTS_USER
FROM
t_contacts_by_fruit
GROUP BY
YYYYMM,
P_TYPE,
P_DESC
HAVING
YYYYMM='201106'
and SUM(P_USERS) >=25000
AND P_TYPE IS NOT NULL
ORDER BY
P_CONTACTS_USER DESC
Any and all help will be greatly appreciated. I'm sure there's a better way to do it initially than what I have listed above too.
Thanks very much!

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

Colleen M. Morrow SSCrazy Points: 2457 More actions · Answer 1

Have a look at the RANK() function. Should make it very easy to get the top 10 per month.

Lots of examples and explanation:

http://weblogs.sqlteam.com/jeffs/archive/2007/03/28/60146.aspx

Colleen M. Morrow
Cleveland DBA

GSquared SSC Guru Points: 260824 More actions · Answer 2

Something like this:

;

WITH CTE

AS (SELECT YYYYMM,

P_TYPE,

P_DESC,

SUM(P_USERS) AS P_USERS,

SUM(P_CONTACTS) AS P_CONTACTS,

CAST(((SUM(P_CONTACTS)) / (SUM(P_USERS))) AS NUMERIC(5,

2)) AS P_CONTACTS_USER

FROM t_contacts_by_fruit

GROUP BY YYYYMM,

P_TYPE,

P_DESC

HAVING YYYYMM = '201106'

AND SUM(P_USERS) >= 25000

AND P_TYPE IS NOT NULL),

CTE2

AS (SELECT YYYYMM,

P_TYPE,

P_DESC,

P_USERS,

P_CONTACTS,

P_CONTACTS_USER,

DENSE_RANK() OVER (PARTITION BY YYYYMM ORDER BY P_CONTACTS_USER DESC) AS Row

FROM CTE)

SELECT YYYYMM,

P_TYPE,

P_DESC,

P_USERS,

P_CONTACTS,

P_CONTACTS_USER

FROM CTE2

WHERE Row >= 10 ;

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mr Corn Man Ten Centuries Points: 1268 More actions · Answer 3

GSquared...

That code unfortunately pulls back more than the top 10. I had to change it to the row number <=10 (perhaps I put that wrong in my initial code sample, I'll have to go back and look).

However, even after changing that, the query returned 12 rows because two records had the same P_CONTACTS_USER value (2 had .50 and 2 had .33).

I think that is the problem with ranking versus sorting by the P_CONTACTS_USER value, then partitioning the results by YYYYMM. When I do this with the query I've listed below here, I get 10 records/month for a total of 130 records, they are ranked in order by the P_CONTACTS_USER, BUT...I lose my grouping on the P_DESC and thus my summary numbers based on the P_DESC:

SELECT

YYYYMM,

P_TYPE,

P_DESC,

SUM(P_USERS) AS P_USERS,

SUM(P_CONTACTS) AS P_CONTACTS,

CAST(((SUM(P_CONTACTS))/(SUM(P_USERS))) AS NUMERIC(5,2)) AS P_CONTACTS_USER

FROM

(SELECT *, row_number() OVER (PARTITION BY YYYYMM ORDER BY (CASE when P_USERS >=25000 then 1 else 99999 end),

P_CONTACTS_USER DESC) AS RowNum

FROM t_contacts_by_fruit) AS TEMP

WHERE P_USERS>=25000 AND

RowNum <=10

ORDER BY YYYYMM DESC, P_CONTACTS_USER DESC, P_CONTACTS DESC

GSquared SSC Guru Points: 260824 More actions · Answer 4

With my query, yes, I got the comparison operator backwards.

If you want to ignore ties, use Row_Number instead of Dense_Rank. Will cut off otherwise valid records if you have ties, especially in the final slots, but will just give you 10.

If you need it to be deterministic, you'll need to add more criteria to the Row_Number function, so that you get the same ones each time in case of a tie in the final slot.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mr Corn Man Ten Centuries Points: 1268 More actions · Answer 5

That's where my problem is. I have the ROW_NUMBER function setup to look at the qty being >=25K, and I have it listing in descending order by the P_CONTACTS_USER, so my results are giving me the top 10 rows per month which is good, they're in descending order by the P_CONTACTS_USER which is good, but they're not grouping and summing based on the P_DESC value.

I can't figure out what I'm missing that's causing that loss of grouping/summing.

I do need to use ROW_NUMBER because I don't want to ignore ties.

Colleen M. Morrow SSCrazy Points: 2457 More actions · Answer 6

Why not just use GSquared's query with RANK instead of DENSE_RANK. That should give you just the 10 rows you want.

What do you mean by you're losing the sum on p_desc?

Colleen M. Morrow
Cleveland DBA

Mr Corn Man Ten Centuries Points: 1268 More actions · Answer 7

Here's what I mean on things getting ungrouped/summed:

YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

201106Bagapples26383684272.59

201106Indoranges42661228330.54

201106Indpears27083145890.54

201106Bagapples14271007552710.53

201106Indlemons3211581617610.5

201106Baggrapes26162115970.44

201106Bagapples14413055874630.41

201106Indoranges42082166600.4

201106Bagapples6465532514450.39

201106Bagapples38655146360.38

That is a sample of the results I'm getting back. See how the apples and oranges are getting split out into multiple rows? They should all be grouped/summed on one row (one row for apples, one for oranges) as shown below (with the exception that 5 additional rows would be included since I need the top 10 rows):

YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

201106Indpears27083145890.54

201106Indlemons3211581617610.50

201106Indoranges4320632128510.49

201106Bagapples357999616772420.47

201106Baggrapes26162115970.44

See how all the 5 rows of apples and two rows of oranges would be combined into 2 summarized rows (1 each) which then changes the overall ranking because the combined P_CONTACTS_USER value drops in these cases.

This is what is failing for me.

Colleen M. Morrow SSCrazy Points: 2457 More actions · Answer 8

So, working with GSquared's query as a basis, shouldn't this give you what you want?

;WITH CTE

AS (SELECT YYYYMM,

P_TYPE,

P_DESC,

SUM(P_USERS) AS P_USERS,

SUM(P_CONTACTS) AS P_CONTACTS,

CAST(((SUM(P_CONTACTS)) / (SUM(P_USERS))) AS NUMERIC(5,

2)) AS P_CONTACTS_USER

FROM t_contacts_by_fruit

GROUP BY YYYYMM,

P_TYPE,

P_DESC

HAVING

SUM(P_USERS) >= 25000

AND P_TYPE IS NOT NULL),

CTE2

AS (SELECT YYYYMM,

P_TYPE,

P_DESC,

P_USERS,

P_CONTACTS,

P_CONTACTS_USER,

RANK() OVER (PARTITION BY YYYYMM ORDER BY P_CONTACTS_USER DESC) AS Row

FROM CTE)

SELECT YYYYMM,

P_TYPE,

P_DESC,

P_USERS,

P_CONTACTS,

P_CONTACTS_USER

FROM CTE2

WHERE Row <= 10

ORDER BY YYYYMM desc

Colleen M. Morrow
Cleveland DBA

Mr Corn Man Ten Centuries Points: 1268 More actions · Answer 9

Okay, now I tried using RANK with GSquared's query and removed my YYYYMM limitation of just 201106. It looked good, but I got 131 rows back...10 for every month except 201102 which returned 11 rows for some reason - I couldn't see anything obvious in the results that caused it to do that.

So I tried using GSquared's query and replaced the DENSE_RANK with ROW_NUMBER and it returned a total of 130 rows, including just 10 for 201102.

These results look promising, but I'm going to need to review them closer to make sure they're correct.

Any reason using ROW_NUMBER in place of DENSE_RANK would not work with this query?

Colleen M. Morrow SSCrazy Points: 2457 More actions · Answer 10

No. Since you definitely only want 10 rows, regardless of ties, I'd go with ROW_NUMBER.

Colleen M. Morrow
Cleveland DBA

Mr Corn Man Ten Centuries Points: 1268 More actions · Answer 11

Mr Corn Man

Ten Centuries

Points: 1268

July 27, 2011 at 9:22 am

#1359373

Cool, thank you both, this looks like it's working!

GSquared SSC Guru Points: 260824 More actions · Answer 12

You're welcome.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon