help with grouping/summarizing data

  • I have a summary table that I need to extract some additional summarized/grouped data out of.

    The table contains the following columns being used:

    YYYYMM,

    P_TYPE, varchar(50)

    P_NBR, varchar(30)

    P_DESC, varchar(50)

    P_USERS, numeric(9,0)

    P_CONTACTS, numeric(9,0)

    P_CONTACTS_USER, numeric(5,2) -->calculated value of P_CONTACTS/P_USERS

    Within the data, there are multiple entries with the same P_DESC value but different P_NBR values. These need to be grouped in the query results by the P_DESC, not the P_NBR. The P_NBR is a unique number that may share a P_DESC with other records. For example:

    YYYYMMP_TYPEP_NBRP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

    201106ind fruita1234apples102.20

    201106ind fruita4321apples581.60

    201106bag fruitb9876oranges74.57

    201106ind fruitb7890apples2012.60

    In the example above, all the data for the P_DESC of 'apples' would need grouped and summarized as such:

    YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

    201106ind fruitapples3522.63

    I can get it to the point the data gets summarized, but I need to pull in ONLY THE TOP 10 RECORDS PER MONTH based on the P_CONTACTS_USER value (I house 13 months of data so in the end, there will be 130 records returned) AND the summed total number of P_USERS must be at least 25,000 to be included.

    When I try to add in row numbers it screws up and starts splitting the data out by P_DESC instead of keeping it grouped by P_DESC.

    Here is the query that is working to get it grouped/summed:

    SELECT

    YYYYMM,

    P_TYPE,

    P_DESC,

    SUM(P_USERS) AS P_USERS,

    SUM(P_CONTACTS) AS P_CONTACTS,

    CAST(((SUM(P_CONTACTS))/(SUM(P_USERS))) AS NUMERIC(5,2)) AS P_CONTACTS_USER

    FROM

    t_contacts_by_fruit

    GROUP BY

    YYYYMM,

    P_TYPE,

    P_DESC

    HAVING

    YYYYMM='201106'

    and SUM(P_USERS) >=25000

    AND P_TYPE IS NOT NULL

    ORDER BY

    P_CONTACTS_USER DESC

    Any and all help will be greatly appreciated. I'm sure there's a better way to do it initially than what I have listed above too.

    Thanks very much!

  • Have a look at the RANK() function. Should make it very easy to get the top 10 per month.

    Lots of examples and explanation:

    http://weblogs.sqlteam.com/jeffs/archive/2007/03/28/60146.aspx



    Colleen M. Morrow
    Cleveland DBA

  • Something like this:

    ;

    WITH CTE

    AS (SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    SUM(P_USERS) AS P_USERS,

    SUM(P_CONTACTS) AS P_CONTACTS,

    CAST(((SUM(P_CONTACTS)) / (SUM(P_USERS))) AS NUMERIC(5,

    2)) AS P_CONTACTS_USER

    FROM t_contacts_by_fruit

    GROUP BY YYYYMM,

    P_TYPE,

    P_DESC

    HAVING YYYYMM = '201106'

    AND SUM(P_USERS) >= 25000

    AND P_TYPE IS NOT NULL),

    CTE2

    AS (SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    P_USERS,

    P_CONTACTS,

    P_CONTACTS_USER,

    DENSE_RANK() OVER (PARTITION BY YYYYMM ORDER BY P_CONTACTS_USER DESC) AS Row

    FROM CTE)

    SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    P_USERS,

    P_CONTACTS,

    P_CONTACTS_USER

    FROM CTE2

    WHERE Row >= 10 ;

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared...

    That code unfortunately pulls back more than the top 10. I had to change it to the row number <=10 (perhaps I put that wrong in my initial code sample, I'll have to go back and look).

    However, even after changing that, the query returned 12 rows because two records had the same P_CONTACTS_USER value (2 had .50 and 2 had .33).

    I think that is the problem with ranking versus sorting by the P_CONTACTS_USER value, then partitioning the results by YYYYMM. When I do this with the query I've listed below here, I get 10 records/month for a total of 130 records, they are ranked in order by the P_CONTACTS_USER, BUT...I lose my grouping on the P_DESC and thus my summary numbers based on the P_DESC:

    SELECT

    YYYYMM,

    P_TYPE,

    P_DESC,

    SUM(P_USERS) AS P_USERS,

    SUM(P_CONTACTS) AS P_CONTACTS,

    CAST(((SUM(P_CONTACTS))/(SUM(P_USERS))) AS NUMERIC(5,2)) AS P_CONTACTS_USER

    FROM

    (SELECT *, row_number() OVER (PARTITION BY YYYYMM ORDER BY (CASE when P_USERS >=25000 then 1 else 99999 end),

    P_CONTACTS_USER DESC) AS RowNum

    FROM t_contacts_by_fruit) AS TEMP

    WHERE P_USERS>=25000 AND

    RowNum <=10

    ORDER BY YYYYMM DESC, P_CONTACTS_USER DESC, P_CONTACTS DESC

  • With my query, yes, I got the comparison operator backwards.

    If you want to ignore ties, use Row_Number instead of Dense_Rank. Will cut off otherwise valid records if you have ties, especially in the final slots, but will just give you 10.

    If you need it to be deterministic, you'll need to add more criteria to the Row_Number function, so that you get the same ones each time in case of a tie in the final slot.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • That's where my problem is. I have the ROW_NUMBER function setup to look at the qty being >=25K, and I have it listing in descending order by the P_CONTACTS_USER, so my results are giving me the top 10 rows per month which is good, they're in descending order by the P_CONTACTS_USER which is good, but they're not grouping and summing based on the P_DESC value.

    I can't figure out what I'm missing that's causing that loss of grouping/summing.

    I do need to use ROW_NUMBER because I don't want to ignore ties.

  • Why not just use GSquared's query with RANK instead of DENSE_RANK. That should give you just the 10 rows you want.

    What do you mean by you're losing the sum on p_desc?



    Colleen M. Morrow
    Cleveland DBA

  • Here's what I mean on things getting ungrouped/summed:

    YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

    201106Bagapples26383684272.59

    201106Indoranges42661228330.54

    201106Indpears27083145890.54

    201106Bagapples14271007552710.53

    201106Indlemons3211581617610.5

    201106Baggrapes26162115970.44

    201106Bagapples14413055874630.41

    201106Indoranges42082166600.4

    201106Bagapples6465532514450.39

    201106Bagapples38655146360.38

    That is a sample of the results I'm getting back. See how the apples and oranges are getting split out into multiple rows? They should all be grouped/summed on one row (one row for apples, one for oranges) as shown below (with the exception that 5 additional rows would be included since I need the top 10 rows):

    YYYYMMP_TYPEP_DESCP_USERSP_CONTACTSP_CONTACTS_USER

    201106Indpears27083145890.54

    201106Indlemons3211581617610.50

    201106Indoranges4320632128510.49

    201106Bagapples357999616772420.47

    201106Baggrapes26162115970.44

    See how all the 5 rows of apples and two rows of oranges would be combined into 2 summarized rows (1 each) which then changes the overall ranking because the combined P_CONTACTS_USER value drops in these cases.

    This is what is failing for me.

  • So, working with GSquared's query as a basis, shouldn't this give you what you want?

    ;WITH CTE

    AS (SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    SUM(P_USERS) AS P_USERS,

    SUM(P_CONTACTS) AS P_CONTACTS,

    CAST(((SUM(P_CONTACTS)) / (SUM(P_USERS))) AS NUMERIC(5,

    2)) AS P_CONTACTS_USER

    FROM t_contacts_by_fruit

    GROUP BY YYYYMM,

    P_TYPE,

    P_DESC

    HAVING

    SUM(P_USERS) >= 25000

    AND P_TYPE IS NOT NULL),

    CTE2

    AS (SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    P_USERS,

    P_CONTACTS,

    P_CONTACTS_USER,

    RANK() OVER (PARTITION BY YYYYMM ORDER BY P_CONTACTS_USER DESC) AS Row

    FROM CTE)

    SELECT YYYYMM,

    P_TYPE,

    P_DESC,

    P_USERS,

    P_CONTACTS,

    P_CONTACTS_USER

    FROM CTE2

    WHERE Row <= 10

    ORDER BY YYYYMM desc



    Colleen M. Morrow
    Cleveland DBA

  • Okay, now I tried using RANK with GSquared's query and removed my YYYYMM limitation of just 201106. It looked good, but I got 131 rows back...10 for every month except 201102 which returned 11 rows for some reason - I couldn't see anything obvious in the results that caused it to do that.

    So I tried using GSquared's query and replaced the DENSE_RANK with ROW_NUMBER and it returned a total of 130 rows, including just 10 for 201102.

    These results look promising, but I'm going to need to review them closer to make sure they're correct.

    Any reason using ROW_NUMBER in place of DENSE_RANK would not work with this query?

  • No. Since you definitely only want 10 rows, regardless of ties, I'd go with ROW_NUMBER.



    Colleen M. Morrow
    Cleveland DBA

  • Cool, thank you both, this looks like it's working!

  • You're welcome.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply