Select top five rows for each key value

Question

Post reply

Select top five rows for each key value

Howard C. BAchtel-438731

SSCrazy

Points: 2384
More actions
October 26, 2011 at 2:06 pm

#245803

The attached is a brief example of a file extracted(Top 5.xls).
The data definition is:
1. Key - Identifier
2. Date
3. Status
The records are sorted by
1. Key
2. Date in descending order
I would like to select the 5 most recent records for each "Key". The second attachment (Desired.xls) represents the first one manually manipulated to represent the desired output.
I would appreciate any suggestions on how to obtain this result using T-SQL.
Thanks in advance.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

goofbauer Mr or Mrs. 500 Points: 521 More actions · Answer 1

Can you use the RANK() function? Something like...

SELECT A.Key, A.Date, A.Status, A.Ranking

FROM (SELECT Key, Date, Status

, RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking

FROM table) A

WHERE A.Ranking <= 5;

drew.allen SSC Guru Points: 76776 More actions · Answer 2

pschwartzbauer (10/26/2011)
Can you use the RANK() function? Something like...
SELECT A.Key, A.Date, A.Status, A.Ranking
FROM (SELECT Key, Date, Status
, RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking
FROM table) A
WHERE A.Ranking <= 5;

The WHERE clause is evaluated before the SELECT clause. That means that the column "Ranking" has not been defined at the point that the WHERE clause is evaluated. You can get around that by using a CTE to define the column and then specify the filter in the outer query.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

goofbauer Mr or Mrs. 500 Points: 521 More actions · Answer 3

Thanks, Drew. So you're suggesting something like this?

WITH Rnk AS (SELECT Key, Date, Status

, RANK() OVER(PARTITION BY Key ORDER BY Date DESC) AS Ranking

FROM table

)

SELECT Key, Date, Status, Ranking

FROM Rnk

WHERE Ranking <= 5;

drew.allen SSC Guru Points: 76776 More actions · Answer 4

Actually, I didn't read your query closely enough. I didn't see that you were using a subquery.

The two approaches are equivalent and will give the exact same execution plans. I personally find the CTE easier to read.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

R.P.Rozema SSChampion Points: 12305 More actions · Answer 5

You should use row_number() instead of rank() if you intend to get the top most 5 entries. rank() will return the same number for rows with duplicate date values, i.e. you may end up returning more than 5 rows for a specific key. row_number() will return a unique number per row.

Have a look at this example:

select t.some_id, rank() over (partition by some_id order by date desc) rank,

row_number() over (partition by some_id order by date desc) row_number,

t.date

from (

select 1 as some_id, {d '1900-01-01'} as date

union all select 1 as some_id, {d '1900-01-01'} as date

union all select 1 as some_id, {d '2011-11-11'} as date

union all select 2 as some_id, {d '1900-01-02'} as date

union all select 1 as some_id, {d '1900-01-01'} as date

) t

It outputs:

some_id rank row_number date

----------- -------------------- -------------------- -----------------------

1 1 1 2011-11-11 00:00:00.000

1 2 2 1900-01-01 00:00:00.000

1 2 3 1900-01-01 00:00:00.000

1 2 4 1900-01-01 00:00:00.000

1 2 5 1900-01-01 00:00:00.000

1 2 6 1900-01-01 00:00:00.000

1 2 7 1900-01-01 00:00:00.000

2 1 1 1900-01-02 00:00:00.000

(8 row(s) affected)

As you can see, the output of rank() is not very useful for selecting up to the first 5 rows and row_number() does work.

edit: descending sort on date added.

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

goofbauer Mr or Mrs. 500 Points: 521 More actions · Answer 6

If ROW_NUMBER is used instead of RANK, won't it produce unpredictable results in cases when there are ties? In R.P.'s data sample, rows 2 through 7 are interpreted by ROW_NUMBER as equal, so the result set may very on each execution. I suggested RANK, because it will produce gaps in the ranking sequence that will be handled properly by the <= 5. Agree?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

pschwartzbauer (10/27/2011)
If ROW_NUMBER is used instead of RANK, won't it produce unpredictable results in cases when there are ties? In R.P.'s data sample, rows 2 through 7 are interpreted by ROW_NUMBER as equal, so the result set may very on each execution. I suggested RANK, because it will produce gaps in the ranking sequence that will be handled properly by the <= 5. Agree?

depends on what kind of unpredictable you want:).

The number of rows in the row_number scenario would be predictable, but "which" of the tie records show through might unpredictable (it will hit the tie and start arbitrarily assigning RN's). the byproduct would be the number of distinct values returned from your date column may be unpredictable.

In the rank however, the # of rows returned would be unpredictable. if your date column has a lot of ties, you could in theory get back hundreds of rows when you only wanted 5. You would get back 5 distinct data values however.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

R.P.Rozema SSChampion Points: 12305 More actions · Answer 8

The request was to produce 5 rows, not the top 5 values. row_number() is the way to produce at most 5 rows. rank() can be used to produce any number of rows with up to the top most 5 values. Very vague. If you would want to produce all rows having the top 5 values, you would need to use dense_rank(). This produces a list of integers in which each next value is either the same as the previous value or at most one higher than that value. i.e. it will return a sequence of 1, 2, 3, 4, 5 (without gaps) for the top most 5 values.

row_number() will indeed, given the specified order-by clause, randomly select any of the duplicate rows if duplicates exist at the 5th and 6th positions (plus any more duplicates outside those). In that regard it behaves exactly like the "select top (5) clause". If you need to influence which rows will be shown, all you need to do is add more order-by column(s).

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

Howard C. BAchtel-438731 SSCrazy Points: 2384 More actions · Answer 9

Thanks to all for your suggestions. I have tried it and it works perfectly as I was pretty sure it would.

Howard