Select top five rows for each key value

  • The attached is a brief example of a file extracted(Top 5.xls).

    The data definition is:

    1. Key - Identifier

    2. Date

    3. Status

    The records are sorted by

    1. Key

    2. Date in descending order

    I would like to select the 5 most recent records for each "Key". The second attachment (Desired.xls) represents the first one manually manipulated to represent the desired output.

    I would appreciate any suggestions on how to obtain this result using T-SQL.

    Thanks in advance.

  • Can you use the RANK() function? Something like...

    SELECT A.Key, A.Date, A.Status, A.Ranking

    FROM (SELECT Key, Date, Status

    , RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking

    FROM table) A

    WHERE A.Ranking <= 5;

  • pschwartzbauer (10/26/2011)


    Can you use the RANK() function? Something like...

    SELECT A.Key, A.Date, A.Status, A.Ranking

    FROM (SELECT Key, Date, Status

    , RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking

    FROM table) A

    WHERE A.Ranking <= 5;

    The WHERE clause is evaluated before the SELECT clause. That means that the column "Ranking" has not been defined at the point that the WHERE clause is evaluated. You can get around that by using a CTE to define the column and then specify the filter in the outer query.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks, Drew. So you're suggesting something like this?

    WITH Rnk AS (SELECT Key, Date, Status

    , RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking

    FROM table

    )

    SELECT Key, Date, Status, Ranking

    FROM Rnk

    WHERE Ranking <= 5;

  • Actually, I didn't read your query closely enough. I didn't see that you were using a subquery.

    The two approaches are equivalent and will give the exact same execution plans. I personally find the CTE easier to read.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • You should use row_number() instead of rank() if you intend to get the top most 5 entries. rank() will return the same number for rows with duplicate date values, i.e. you may end up returning more than 5 rows for a specific key. row_number() will return a unique number per row.

    Have a look at this example:

    select t.some_id, rank() over (partition by some_id order by date desc) rank,

    row_number() over (partition by some_id order by date desc) row_number,

    t.date

    from (

    select 1 as some_id, {d '1900-01-01'} as date

    union all select 1 as some_id, {d '1900-01-01'} as date

    union all select 1 as some_id, {d '1900-01-01'} as date

    union all select 1 as some_id, {d '1900-01-01'} as date

    union all select 1 as some_id, {d '2011-11-11'} as date

    union all select 2 as some_id, {d '1900-01-02'} as date

    union all select 1 as some_id, {d '1900-01-01'} as date

    union all select 1 as some_id, {d '1900-01-01'} as date

    ) t

    It outputs:

    some_id rank row_number date

    ----------- -------------------- -------------------- -----------------------

    1 1 1 2011-11-11 00:00:00.000

    1 2 2 1900-01-01 00:00:00.000

    1 2 3 1900-01-01 00:00:00.000

    1 2 4 1900-01-01 00:00:00.000

    1 2 5 1900-01-01 00:00:00.000

    1 2 6 1900-01-01 00:00:00.000

    1 2 7 1900-01-01 00:00:00.000

    2 1 1 1900-01-02 00:00:00.000

    (8 row(s) affected)

    As you can see, the output of rank() is not very useful for selecting up to the first 5 rows and row_number() does work.

    edit: descending sort on date added.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • If ROW_NUMBER is used instead of RANK, won't it produce unpredictable results in cases when there are ties? In R.P.'s data sample, rows 2 through 7 are interpreted by ROW_NUMBER as equal, so the result set may very on each execution. I suggested RANK, because it will produce gaps in the ranking sequence that will be handled properly by the <= 5. Agree?

  • pschwartzbauer (10/27/2011)


    If ROW_NUMBER is used instead of RANK, won't it produce unpredictable results in cases when there are ties? In R.P.'s data sample, rows 2 through 7 are interpreted by ROW_NUMBER as equal, so the result set may very on each execution. I suggested RANK, because it will produce gaps in the ranking sequence that will be handled properly by the <= 5. Agree?

    depends on what kind of unpredictable you want:).

    The number of rows in the row_number scenario would be predictable, but "which" of the tie records show through might unpredictable (it will hit the tie and start arbitrarily assigning RN's). the byproduct would be the number of distinct values returned from your date column may be unpredictable.

    In the rank however, the # of rows returned would be unpredictable. if your date column has a lot of ties, you could in theory get back hundreds of rows when you only wanted 5. You would get back 5 distinct data values however.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • The request was to produce 5 rows, not the top 5 values. row_number() is the way to produce at most 5 rows. rank() can be used to produce any number of rows with up to the top most 5 values. Very vague. If you would want to produce all rows having the top 5 values, you would need to use dense_rank(). This produces a list of integers in which each next value is either the same as the previous value or at most one higher than that value. i.e. it will return a sequence of 1, 2, 3, 4, 5 (without gaps) for the top most 5 values.

    row_number() will indeed, given the specified order-by clause, randomly select any of the duplicate rows if duplicates exist at the 5th and 6th positions (plus any more duplicates outside those). In that regard it behaves exactly like the "select top (5) clause". If you need to influence which rows will be shown, all you need to do is add more order-by column(s).



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Thanks to all for your suggestions. I have tried it and it works perfectly as I was pretty sure it would.

    Howard

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply