parse comma-delimited list into seperate columns

Question

parse comma-delimited list into seperate columns

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply

LutzM SSC Guru Points: 107049 More actions · Answer 1

CirquedeSQLeil (1/11/2010)
I have tested both scripts and compared results.
I would go with the script provided by Wayne. When the result sets are small, then both are similar. When dealing with 100,000 records or so - the script by Wayne outperforms the first script substantially.

Would you mind sharing some rough numbers?

It's interesting to notice that a concept that's actually performing a "physical" update followed by an additional select would be faster than dealing with a select only.

I guess it's due to the group by clause when using the CrossTab solution...

Would be interesting to see what would happen if there'd be additional indexes that need to be updated as well when using the quirky update method compared to the CrossTab solution being supported by proper indexing...

It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...

Great stuff, Wayne!!

@jason: You're absolutely right. I didn't provide the complete query. Just wanted to show the concept...

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281324 More actions · Answer 2

lmu92 (1/11/2010)
CirquedeSQLeil (1/11/2010)
I have tested both scripts and compared results.
I would go with the script provided by Wayne. When the result sets are small, then both are similar. When dealing with 100,000 records or so - the script by Wayne outperforms the first script substantially.
Would you mind sharing some rough numbers?
It's interesting to notice that a concept that's actually performing a "physical" update followed by an additional select would be faster than dealing with a select only.
I guess it's due to the group by clause when using the CrossTab solution...
Would be interesting to see what would happen if there'd be additional indexes that need to be updated as well when using the quirky update method compared to the CrossTab solution being supported by proper indexing...
It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...
Great stuff, Wayne!!
@jason: You're absolutely right. I didn't provide the complete query. Just wanted to show the concept...

I will need to run the tests again. The tests were complete solution (minus data population since that skews the results).

Waynes solution would finish in about 6 sec, and the combo solution from you and me was still running after a minute. The tests were both done with 100,000 records.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

LutzM SSC Guru Points: 107049 More actions · Answer 3

I would use an index on ID, idx on table [test] for it...

Running over a minute for 100k rows seems strange...

What test scenario do you use?

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281324 More actions · Answer 4

lmu92 (1/11/2010)
I would use an index on ID, idx on table [test] for it...
Running over a minute for 100k rows seems strange...
What test scenario do you use?

I will get back to you on that.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

vstitte SSCarpal Tunnel Points: 4475 More actions · Answer 5

Wayne, Cool script.

BUT in the final Select statement, did you mean to use column names instead of the variables?

-- now, get the data for each column

SELECT [ID] = ROW_NUMBER() OVER (ORDER BY RowData),

[Account] = LEFT(RowData, @Col1EndPos-1),

[Year] = SUBSTRING(RowData, @Col1EndPos+1, @Col2EndPos-@Col1EndPos-1),

[Month] = SUBSTRING(RowData, @Col2EndPos+1, @Col3EndPos-@Col2EndPos-1),

[Scenario] = SUBSTRING(RowData, @Col3EndPos+1, @Col4EndPos-@Col3EndPos-1),

[Version] = SUBSTRING(RowData, @Col4EndPos+1, @Col5EndPos-@Col4EndPos-1),

[Entity] = SUBSTRING(RowData, @Col5EndPos+1, len(RowData)-@Col5EndPos-1)

FROM #Test

In the sample data, if you change "Apr" ro "April", now the columns are variable width.

I get the correct result, if I use column names. Variables names will work if the columns are fixed width.

But this is an awesome script.

SELECT 'xyz 54050, FY11, April, Budget, Version_1, 0160117'

UNION ALL

SELECT 'abc 54050, FY11, May, Budget, Version_1, 0160117'

WayneS SSC Guru Points: 95405 More actions · Answer 6

vstitte (1/11/2010)
Wayne, Cool script.
BUT in the final Select statement, did you mean to use column names instead of the variables?

WHOOPS! You are exactly right about that! Good catch. I have modified the above post with the script to do this.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95405 More actions · Answer 7

lmu92 (1/11/2010)
It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...
Great stuff, Wayne!!

Hey guys, I really can't take the credit for this. I just posted what I've learned here from Jeff, Gus, Lynn and others. It's just become so ingrained in me that it's become natural when seeing a problem like this.

About the alternatives being shared on this community... that is what I love about this site so much. Every programmer knows that if you get 5 programmers in a room together to hash out a problem, they'll come out of the room with 10 solutions that all work... it's just an issue of testing to see which is the best. That is what we see so much of on this site, and why it's always such a good idea to read more than the initial post/reply... some of the best stuff will be buried a couple pages in.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95405 More actions · Answer 8

Okay, after playing around with this for a bit, I came up with a solution that does just a single scan.

Part 1: build the test environment. This builds a test table of 1,310,720 records.

-- if temp table already exists (failed previous run), drop it

if OBJECT_ID('tempdb..#test') IS NOT NULL DROP TABLE #test

-- simulate the table with the data in it.

-- NOTE how your sample data was put into a table

-- to make it easier for us volunteers to work with it!

DECLARE @test-2 TABLE (ID int IDENTITY, RowData varchar(75))

INSERT INTO @test-2

SELECT 'xyz 54050, FY11, Apr, Budget, Version_1, 0160117' UNION ALL

SELECT 'abc 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

SELECT 'def 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

SELECT 'ghi 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

SELECT 'jkl 54050, FY11, May, Budget, Version_1, 0160117'

-- get the data from the table and put it into a temporary work table

SELECT ID, RowData

INTO #TEST

FROM @test-2

GO

insert into #TEST

SELECT RowData from #TEST

GO 18 -- makes a test of 1,310,720 records

Part 2: the actual query. Note that I'm returning the results into a temp table to take the screen display of all these rows out of the equation.

-- NOW, run just this portion

-- if temp table already exists (failed previous run), drop it

SET STATISTICS IO ON

SET STATISTICS TIME ON

if OBJECT_ID('tempdb..#test2') IS NOT NULL DROP TABLE #test2

;with CTE1 AS

(

select ID, RowData,

Col1EndPos = CharIndex(',', RowData)

from #TEST

), CTE2 AS

(

select ID, RowData,

Col1EndPos,

Col2EndPos = CharIndex(',', RowData, Col1EndPos + 1)

from CTE1

), CTE3 AS

(

select ID, RowData,

Col1EndPos,

Col2EndPos,

Col3EndPos = CharIndex(',', RowData, Col2EndPos + 1)

from CTE2

), CTE4 AS

(

select ID, RowData,

Col1EndPos,

Col2EndPos,

Col3EndPos,

Col4EndPos = CharIndex(',', RowData, Col3EndPos + 1)

from CTE3

), CTE5 AS

(

select ID, RowData,

Col1EndPos,

Col2EndPos,

Col3EndPos,

Col4EndPos,

Col5EndPos = CharIndex(',', RowData, Col4EndPos + 1)

from CTE4

)

select ID, --= ROW_NUMBER() OVER (ORDER BY RowData),

[Account] = LEFT(RowData, Col1EndPos-1),

[Year] = SUBSTRING(RowData, Col1EndPos+1, Col2EndPos-Col1EndPos-1),

[Month] = SUBSTRING(RowData, Col2EndPos+1, Col3EndPos-Col2EndPos-1),

[Scenario] = SUBSTRING(RowData, Col3EndPos+1, Col4EndPos-Col3EndPos-1),

[Version] = SUBSTRING(RowData, Col4EndPos+1, Col5EndPos-Col4EndPos-1),

[Entity] = SUBSTRING(RowData, Col5EndPos+1, len(RowData)-Col5EndPos-1)

into #TEST2

from CTE5

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

This returns the following results:

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
Table '#TEST___000000000017'. Scan count 1, logical reads 10577, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1310720 row(s) affected)
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 18284 ms, elapsed time = 18684 ms.

With an execution plan of (see "Execution Plan 1.jpg")

If I remark out the "into #TEST2", I get the following results:

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(1310720 row(s) affected)
Table '#TEST___00000000001A'. Scan count 1, logical reads 10577, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 17285 ms, elapsed time = 23992 ms.

With an execution plan of (see "Execution Plan 2.jpg")

(As you can see, it took an additional 5.3 seconds to pump the data to the screen vs. to a temp table.)

And, finally, running the original "quirky update" method against this test data gets the following results:

Table '#TEST___00000000001E'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1310720 row(s) affected)
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 8549 ms, elapsed time = 24874 ms.
(1310720 row(s) affected)
Table '#TEST___00000000001E'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 2090 ms, elapsed time = 21351 ms.

With an execution plan of (see "Execution Plan 3.jpg")

This takes the most time. And, it takes more logical reads - I assume because of the larger size of the #TEST table with the additional 5 columns.

PS. Jason - just how do you go about embedding the execution plans inside the messages? I've seen you do this many times...

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

SQLRNNR SSC Guru Points: 281324 More actions · Answer 9

lmu92 (1/11/2010)

CirquedeSQLeil (1/11/2010)
...

Ok, I now have had some time to go back and redo the tests.

I found one flaw in my tests - they were not equivalent. In the test for what I will call the Wayne Script, I was using a list of 6 comma separated values. For the CTE method, I was using 9. Thus the results would have been a little bloated.

To try and make the CTE faster, I have done the following -

Dumped the CTE results to a temp table. Temp table was created with as follows

Create Table #tmp2 (id int identity(1,1) primary key clustered, stringid int , lft varchar(1000), rght varchar(1000), idx int)

I changed that to

Create Table #tmp2 (stringid int , lft varchar(1000), rght varchar(1000), idx int)

Create clustered index IdxStringID on #tmp2 (StringID)

That change bought me about 16 seconds in processing time.

I then decided to try an additional NC index on the idx field. First, I tried it immediately after the table create - that slowed the query down by 16 seconds. Then I tried it after populating the temp table and it was better but still slower (by about 2 seconds). Thus I scrapped that index.

Then I checked to see the results of creating the clustered index after the table insert. Again, that slowed it down by about 2 seconds overall.

When I break the query apart (cte and table insert from the final select), then we can see that the CTE is obviously the bottleneck. The final select finishes in 2 seconds (the select that Lutz provided). This select statement is faster by 4 seconds that Waynes query.

Then I went to adjust Waynes query. First, I moved the data creation and table creation out of the query. I created a similar Test table to hold the data with a primary key on an integer field with a clustered index.

And now that they are a little more evenly matched - they both return in about 2 seconds (final selects) for 100,000 records. (I also removed the row_number from the query as Wayne suggested.)

Overall, the setbased is faster. The final selects are comparable.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

SQLRNNR SSC Guru Points: 281324 More actions · Answer 10

And now, while I was typing up my last post, I see that Wayne has gone and updated the script he provided.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

WayneS (1/11/2010)
Well, the concept comes from Jeff. 'nuf said. 😉

:blush:

Still, that's an interesting warp on the "quirky update". Nicely done, Wayne.

It would be interesting to run a race between an nice "cross apply split" and this method.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

@Wayne, Lutz, and Jason,

When some of the Ninjas on this site did some extreme "split" testing, one of the things that we found out was that if all the rows have the commas all in the same place, the optimizer figured that out and made the split run as fast a "fixed field" example.

You might want to try your tests with randomly variable length fields in the CSV.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

WayneS SSC Guru Points: 95405 More actions · Answer 13

Good feedback Jeff.

Probably not random enough, but I changed the 5 initial rows (abc xxxxx, etc.) to be:

1st row: abc 1,

2nd row: zyx 100,

3rd row: def 10000,

4th row: ghi 1000000,

5th row: jkl 100000000,

This produced similar results (results being dumped into #TEST2):

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#TEST___000000000021'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1310720 row(s) affected)
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 18252 ms, elapsed time = 18560 ms.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

SQLRNNR SSC Guru Points: 281324 More actions · Answer 14

WayneS (1/11/2010)
Good feedback Jeff.
Probably not random enough, but I changed the 5 initial rows (abc xxxxx, etc.) to be:
1st row: abc 1,
2nd row: zyx 100,
3rd row: def 10000,
4th row: ghi 1000000,
5th row: jkl 100000000,
This produced similar results (results being dumped into #TEST2):
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
Table '#TEST___000000000021'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1310720 row(s) affected)
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 18252 ms, elapsed time = 18560 ms.

Nice work. It is nice to see why a certain method is better than another.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

SQLRNNR SSC Guru Points: 281324 More actions · Answer 15

Jeff Moden (1/11/2010)
WayneS (1/11/2010)
Well, the concept comes from Jeff. 'nuf said. 😉
:blush:
Still, that's an interesting warp on the "quirky update". Nicely done, Wayne.
It would be interesting to run a race between an nice "cross apply split" and this method.

Any takers?

Wayne, I found a really slow point in your second script (the million records). The table population stuff. You will probably notice throughout the site that there are some really fast methods to do this. Not that that section really affects the split script any - just a thought though.;-)

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events