Grouping by min/max in a series

Question

Post reply

Grouping by min/max in a series

rmcknight

SSC Eights!

Points: 868
More actions
March 13, 2007 at 12:49 pm

#172455

Good Afternoon;
Any chance I can do the following in T-SQL. I've done something similar, and as usual, can't find the source.
Table:
Code    Status
1       S
2       T
3       T
4       T
5       S
6       S
7       S
10      A
11      T
12      T
.
.
.
Results desired:
min     max     status
1       1       S
2       4       T
5       7       S
10      10      A
11      14      T
.
.
appreciate any thoughts...
Randy.

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

PW-201837 SSC-Insane Points: 20805 More actions · Answer 1

Does column Code contain unique ascending sequence of values ?

rmcknight SSC Eights! Points: 868 More actions · Answer 2

Yes. There may be skips (1,2,3,5,6,7,11,12,13), but they are unique, ascending.

Brian Kesselman SSC Enthusiast Points: 134 More actions · Answer 3

This one is a little ugly but it should work. Depending on your real purpose this might be a job for a cursor. Otherwise, if you are using SQL 2005 you can do it with a PARTITION statement (Google for "SQL count consecutive rows same value"). Otherwise, here is a shot:

SELECT DISTINCT

ISNULL(

(SELECT TOP 1 PriorTable.Code AS PriorCodeStatus

FROM MyTable AS PriorTable

WHERE PriorTable.Code <= MyTable.Code

AND PriorTable.Status = MyTable.Status

AND PriorTable.Code >=

ISNULL(

(SELECT TOP 1 PrevTable.Code FROM MyTable AS PrevTable

WHERE PrevTable.Code < MyTable.Code

AND PrevTable.Status <> MyTable.Status

ORDER BY Code Desc),0)

ORDER BY Code),0)

AS MinCode,

ISNULL(

(SELECT TOP 1 MaxTable.Code AS MaxCodeStatus

FROM MyTable AS MaxTable

WHERE MaxTable.Code >= MyTable.Code

AND MaxTable.Status = MyTable.Status

AND MaxTable.Code <=

ISNULL(

(SELECT TOP 1 NxtTable.Code FROM MyTable AS NxtTable

WHERE NxtTable.Code > MyTable.Code

AND NxtTable.Status <> MyTable.Status

ORDER BY Code),Code)

ORDER BY Code DESC),0)

AS MaxCode, Status

FROM MyTable

ORDER BY MinCode

Basically it is a series of nested TOP 1 queries which determine the first, last, previous and next Code for a given Status grouping.

Hope that helps.

Brian

rmcknight SSC Eights! Points: 868 More actions · Answer 4

Brian -

Sorry for the long pause, but thanks, that was the trick! I'm breaking it down now to 'burn -in' the logic.

Ken McKelvey SSCoach Points: 18484 More actions · Answer 5

The following is similar to Brian's code but should be more efficient. If you have more than 10000 rows you may be better off using cursors/temp tables.

SELECT MIN(D.MinCode) AS MinCode

,D.MaxCode

,D.status

FROM (

SELECT T1.code AS MinCode

,MAX(T2.code) AS MaxCode

,T1.status

FROM YourTable T1

JOIN YourTable T2

ON T1.status = T2.status

AND T1.code <= T2.code

WHERE T1.status = ALL (

SELECT T3.status

FROM YourTable T3

WHERE T3.code BETWEEN T1.code AND T2.code

)

GROUP BY T1.code, T1.status

) D

GROUP BY D.MaxCode, D.status

ORDER BY MinCode

Ken McKelvey SSCoach Points: 18484 More actions · Answer 6

The logic for working out the above is:

1. Self join the table on status where one code is less than or equal to the other code:

SELECT T1.code AS MinCode

,T2.code AS MaxCode

,T1.status

FROM YourTable T1

JOIN YourTable T2

ON T1.status = T2.status

AND T1.code <= T2.code

2. Filter out rows where intermediate statuses are not the same.

SELECT T1.code AS MinCode

,T2.code AS MaxCode

,T1.status

FROM YourTable T1

JOIN YourTable T2

ON T1.status = T2.status

AND T1.code <= T2.code

WHERE T1.status = ALL (

SELECT T3.status

FROM YourTable T3

WHERE T3.code BETWEEN T1.code AND T2.code

)

3. Then do MIN and MAX to get the start and end codes of the range.

rmcknight SSC Eights! Points: 868 More actions · Answer 7

Ken, Brian-

Thanks for the replys. Now off to do validation as both answers returned different #rows. Let me compare the two and I'll get back.

Ken McKelvey SSCoach Points: 18484 More actions · Answer 8

Thanks to David le Quesne for showing me the above approach. Here is the thread:

http://qa.sqlservercentral.com/forums/shwmessage.aspx?forumid=8&messageid=303210&p=2

rmcknight SSC Eights! Points: 868 More actions · Answer 9

After review -

Original table contained non-unique entries, in error. Obtained correct list and refreshed. Table contains 848 rows.

Both solutions now retrive identical count & valued rows (157). Only difference is time. Brians solution returns in less than 1 second , while Kens took 11 seconds.

Brian Kesselman SSC Enthusiast Points: 134 More actions · Answer 10

In case someone wanted Brian's solution broken down into its process and components I went back through and did that.

Process:

1. Select all the rows (MyTable)

2. SubQuery: Get the highest Code (PrevTable) where the Status doesn't match the current row and Code < current row (Top 1 DESC)

3. SubQuery: Get the lowest Code (PriorTable) where the Status is equal to the the current row and Code > #2 result. (Top 1 ASC) That should be the first occurrence of this Status in the current group.

4. Wrap #3 result in an IsNull->0 and that is the Minimum Code for the group

5. SubQuery: Get the lowest Code (NxtTable) where the Status doesn't match the current row and Code is greater than current row, return current code for Null (Top 1 ASC)

6. SubQuery: Get the highest Code (MaxTable) where the Status is equal to the the current row and Code <= #5 result. (Top 1 DESC) That should be the last occurrence of this Status in the current group.

7. I think there is an unnecessary IsNull->0 there.

8. Return only the DISTINCT records to eliminate the dupes since the recordset is not grouped.

#1 and 8:

SELECT DISTINCT

(#2-4) AS MinCode,

(5-7) AS MaxCode,

Status

FROM MyTable ORDER BY MinCode

#2:

ISNULL(

(SELECT TOP 1 PrevTable.Code FROM MyTable AS PrevTable

WHERE PrevTable.Code < MyTable.Code

AND PrevTable.Status <> MyTable.Status

ORDER BY Code Desc),0)

#3-4:

ISNULL(

(SELECT TOP 1 PriorTable.Code AS PriorCodeStatus

FROM MyTable AS PriorTable

WHERE PriorTable.Code <= MyTable.Code

AND PriorTable.Status = MyTable.Status

AND PriorTable.Code >=

//***** #1 Subquery Here

ORDER BY Code),0)

#5:

ISNULL(

(SELECT TOP 1 NxtTable.Code FROM MyTable AS NxtTable

WHERE NxtTable.Code > MyTable.Code

AND NxtTable.Status <> MyTable.Status

ORDER BY Code),Code)

#6-7:

ISNULL(

(SELECT TOP 1 MaxTable.Code AS MaxCodeStatus

FROM MyTable AS MaxTable

WHERE MaxTable.Code >= MyTable.Code

AND MaxTable.Status = MyTable.Status

AND MaxTable.Code <=

//***** #5 Subquery Here

ORDER BY Code DESC),0)

I hope that is helpful. I think you can eliminate the IsNull in #6-7 and also the one that wraps #3. Beyond that it is just a brute force solution that avoids the restrictions that Group By imposes on subqueries. To really do this properly I would probably attempt to do this with a cursor and with a temp table or a function that returns a table and compare the execution plans and statistics based on the actual data. Of course, like lots of us, once I have a solution it is on to the next crisis...

Ken McKelvey SSCoach Points: 18484 More actions · Answer 11

Thanks for this.

As a matter of interest, how quick is my style of solution if there is an index on status,code?

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 12

Try this too!

-- Prepare sample data

DECLARE @Sample TABLE (ID INT, Status VARCHAR(9))

INSERT @Sample

SELECT 1, 'S' UNION ALL

SELECT 2, 'T' UNION ALL

SELECT 3, 'T' UNION ALL

SELECT 5, 'T' UNION ALL

SELECT 10, 'S' UNION ALL

SELECT 16, 'S' UNION ALL

SELECT 17, 'S' UNION ALL

SELECT 20, 'A' UNION ALL

SELECT 31, 'T' UNION ALL

SELECT 32, 'T'

-- Stage the data

DECLARE @Stage TABLE (RecID INT IDENTITY(1, 1), ID INT, Status VARCHAR(9), First INT, Last INT)

INSERT @Stage (ID, Status)

SELECT ID,

Status

FROM @Sample

ORDER BY ID

UPDATE s

SET s.First = (SELECT MAX(x.RecID) FROM @Stage AS x WHERE x.Status <> s.Status AND x.RecID < s.RecID) + 1,

s.Last = (SELECT MIN(x.RecID) FROM @Stage AS x WHERE x.Status <> s.Status AND x.RecID > s.RecID) - 1

FROM @Stage AS s

UPDATE @Stage

SET First = 1

WHERE First IS NULL

DECLARE @Max INT

SELECT @Max = MAX(RecID)

FROM @Stage

WHERE Last IS NULL

UPDATE @Stage

SET Last = @max-2

WHERE Last IS NULL

-- Show the expected result

SELECT DISTINCT t1.ID AS [Min],

t2.ID AS [Max],

s.Status

FROM @Stage AS s

INNER JOIN @Stage AS t1 ON t1.RecID = s.First

INNER JOIN @Stage AS t2 ON t2.RecID = s.Last

ORDER BY t1.ID

N 56°04'39.16"
E 12°55'05.25"

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 13

As some have suggested, both methods get pretty slow, even in the presence of proper indexing, simply because of the triangular joins involved. As is typical of large triangular joins, they spawn so many "internal" rows trying to come up with an answer that they become pretty much unusable for anything more than about 10,000 rows. Here's a chart of the performance of Brian's, Ken's, and a "new" query to do this… All times are in seconds...

Rows

500

884

1000

2000

4000

8000

16000

100000

1000000

Jeff

0.123

0.190

0.250

0.436

0.483

0.843

1.390

6.716

41.860

Brian

0.576

1.483

1.830

7.080

29.373

118.720

466.576

Ken

0.376

1.296

1.640

6.483

30.283

145.938

762.016

Before we get to the "Jeff" query, lemme just say that the idea of using either a CURSOR or a WHILE Loop would probably get the job done but I'm not so sure it would get it done as quickly. Yeah, a cursor would probably do it... but not as fast as what I'm going to show you... and, don't balk at the temp table... where do you think a cursor is stored? Thats right... TempDB... just like the internals for triangular joins or a temp table.

So, if you have a million row table that looks like this...

SET NOCOUNT ON

USE TempDB

--===== If the test table already exists, drop it

IF OBJECT_ID('MyTable') IS NOT NULL

DROP TABLE MyTable

--===== Create the test table

CREATE TABLE MyTable (Code INT IDENTITY(1,1),Status CHAR(1) NOT NULL)

--===== Populate the test table with required data

INSERT INTO MyTable (Status)

SELECT TOP 1000000

CAST(RAND(CAST(NEWID() AS VARBINARY))*3.0 AS INT)+1 AS Status

FROM Master.dbo.SYSCOLUMNS sc1,

Master.dbo.SYSCOLUMNS sc2

UPDATE MyTable

SET Status = CASE STATUS WHEN '1' THEN 'A' WHEN '2' THEN 'S' WHEN '3' THEN 'T' END

--===== Add some indexes just to be fair

ALTER TABLE MyTable

ADD CONSTRAINT PK_MyTable_Code_Status PRIMARY KEY CLUSTERED (Code ASC,Status ASC)

CREATE INDEX IX_MyTable2 ON MyTable (Code DESC,Status DESC)

--===== Display the first 100 rows

SELECT top 100 * FROM MyTable ORDER BY Code

... the following code will work faster than a cursor (just can't bring myself to use one ) and just about any other method...

--===== Variable for speed measurement only

DECLARE @StartTime DATETIME

SET @StartTime = GETDATE()

--===== Move the data into a temp table where we can work on it

-- and add a couple of extra columns (takes about 3 seconds_

SELECT Code,Status,CAST(NULL AS INT) AS MyMin, CAST(NULL AS INT) AS MyMax

INTO #MyHead

FROM MyTable

--===== Add some required indexing (takes about 5 seconds)

ALTER TABLE #MyHead

ADD CONSTRAINT PK_MyHead_Code_Status PRIMARY KEY CLUSTERED (Code ASC,Status ASC)

CREATE INDEX IX_MyHead2 ON #MyHead (Code DESC,Status DESC)

--===== Create an update statement using dynamic SQL so the new indexes can be used

-- in this same batch without an error (takes 0 seconds)

DECLARE @SQL VARCHAR(8000)

SET @SQL = '

DECLARE @PrevStatus CHAR(1)

DECLARE @PrevCode INT

SET @PrevStatus = '' ''

SET @PrevCode = 0

--===== "Smears" the Mins across all rows

UPDATE #MyHead

SET @PrevCode = MyMin = CASE WHEN Status = @PrevStatus THEN @PrevCode ELSE Code END,

@PrevStatus = Status

FROM #MyHead WITH (INDEX(PK_MyHead_Code_Status))

SET @PrevStatus = '' ''

SET @PrevCode = 0

--===== "Smears" the Maxs across all rows

UPDATE #MyHead

SET @PrevCode = MyMax = CASE WHEN Status = @PrevStatus THEN @PrevCode ELSE Code END,

@PrevStatus = Status

FROM #MyHead WITH (INDEX(IX_MyHead2))

'

--===== Execute the dynamic SQL to do the min/max "smears" (takes about 19 seconds)

EXEC (@SQL)

--===== Display the desired result (takes about 10 seconds)

SELECT DISTINCT

MyMin AS Min, MyMax AS Max, Status

FROM #MyHead

ORDER BY Min

DROP TABLE #MyHead

PRINT DATEDIFF(ms,@StartTime,GETDATE())

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Peter, not sure what your's would do if you added a primary key to the table but it currently take about 15 seconds to do 4,000 rows, 60 seconds to do 8,000 rows, and 240 seconds to do 16,000 rows (double the rows takes 4 times longer). Like I said, those nasty little triangular joins just eat your face off

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions