parse comma-delimited list into seperate columns

  • CirquedeSQLeil (1/11/2010)


    I have tested both scripts and compared results.

    I would go with the script provided by Wayne. When the result sets are small, then both are similar. When dealing with 100,000 records or so - the script by Wayne outperforms the first script substantially.

    Would you mind sharing some rough numbers?

    It's interesting to notice that a concept that's actually performing a "physical" update followed by an additional select would be faster than dealing with a select only.

    I guess it's due to the group by clause when using the CrossTab solution...

    Would be interesting to see what would happen if there'd be additional indexes that need to be updated as well when using the quirky update method compared to the CrossTab solution being supported by proper indexing...

    It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...

    Great stuff, Wayne!!

    @jason: You're absolutely right. I didn't provide the complete query. Just wanted to show the concept...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (1/11/2010)


    CirquedeSQLeil (1/11/2010)


    I have tested both scripts and compared results.

    I would go with the script provided by Wayne. When the result sets are small, then both are similar. When dealing with 100,000 records or so - the script by Wayne outperforms the first script substantially.

    Would you mind sharing some rough numbers?

    It's interesting to notice that a concept that's actually performing a "physical" update followed by an additional select would be faster than dealing with a select only.

    I guess it's due to the group by clause when using the CrossTab solution...

    Would be interesting to see what would happen if there'd be additional indexes that need to be updated as well when using the quirky update method compared to the CrossTab solution being supported by proper indexing...

    It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...

    Great stuff, Wayne!!

    @jason: You're absolutely right. I didn't provide the complete query. Just wanted to show the concept...

    I will need to run the tests again. The tests were complete solution (minus data population since that skews the results).

    Waynes solution would finish in about 6 sec, and the combo solution from you and me was still running after a minute. The tests were both done with 100,000 records.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I would use an index on ID, idx on table [test] for it...

    Running over a minute for 100k rows seems strange...

    What test scenario do you use?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (1/11/2010)


    I would use an index on ID, idx on table [test] for it...

    Running over a minute for 100k rows seems strange...

    What test scenario do you use?

    I will get back to you on that.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Wayne, Cool script.

    BUT in the final Select statement, did you mean to use column names instead of the variables?

    -- now, get the data for each column

    SELECT [ID] = ROW_NUMBER() OVER (ORDER BY RowData),

    [Account] = LEFT(RowData, @Col1EndPos-1),

    [Year] = SUBSTRING(RowData, @Col1EndPos+1, @Col2EndPos-@Col1EndPos-1),

    [Month] = SUBSTRING(RowData, @Col2EndPos+1, @Col3EndPos-@Col2EndPos-1),

    [Scenario] = SUBSTRING(RowData, @Col3EndPos+1, @Col4EndPos-@Col3EndPos-1),

    [Version] = SUBSTRING(RowData, @Col4EndPos+1, @Col5EndPos-@Col4EndPos-1),

    [Entity] = SUBSTRING(RowData, @Col5EndPos+1, len(RowData)-@Col5EndPos-1)

    FROM #Test

    In the sample data, if you change "Apr" ro "April", now the columns are variable width.

    I get the correct result, if I use column names. Variables names will work if the columns are fixed width.

    But this is an awesome script.

    SELECT 'xyz 54050, FY11, April, Budget, Version_1, 0160117'

    UNION ALL

    SELECT 'abc 54050, FY11, May, Budget, Version_1, 0160117'

  • vstitte (1/11/2010)


    Wayne, Cool script.

    BUT in the final Select statement, did you mean to use column names instead of the variables?

    WHOOPS! You are exactly right about that! Good catch. I have modified the above post with the script to do this.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • lmu92 (1/11/2010)


    It's always awesome to see the alternatives being shared on this community! By no means I'm going to disrespect Wanes solution! Not at all!! It's just interesting that there are solutions out there that perform better even one would expect it to be "slower" by just having a "basic glance" at it...

    Great stuff, Wayne!!

    Hey guys, I really can't take the credit for this. I just posted what I've learned here from Jeff, Gus, Lynn and others. It's just become so ingrained in me that it's become natural when seeing a problem like this.

    About the alternatives being shared on this community... that is what I love about this site so much. Every programmer knows that if you get 5 programmers in a room together to hash out a problem, they'll come out of the room with 10 solutions that all work... it's just an issue of testing to see which is the best. That is what we see so much of on this site, and why it's always such a good idea to read more than the initial post/reply... some of the best stuff will be buried a couple pages in.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Okay, after playing around with this for a bit, I came up with a solution that does just a single scan.

    Part 1: build the test environment. This builds a test table of 1,310,720 records.

    -- if temp table already exists (failed previous run), drop it

    if OBJECT_ID('tempdb..#test') IS NOT NULL DROP TABLE #test

    -- simulate the table with the data in it.

    -- NOTE how your sample data was put into a table

    -- to make it easier for us volunteers to work with it!

    DECLARE @test-2 TABLE (ID int IDENTITY, RowData varchar(75))

    INSERT INTO @test-2

    SELECT 'xyz 54050, FY11, Apr, Budget, Version_1, 0160117' UNION ALL

    SELECT 'abc 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

    SELECT 'def 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

    SELECT 'ghi 54050, FY11, May, Budget, Version_1, 0160117' UNION ALL

    SELECT 'jkl 54050, FY11, May, Budget, Version_1, 0160117'

    -- get the data from the table and put it into a temporary work table

    SELECT ID, RowData

    INTO #TEST

    FROM @test-2

    GO

    insert into #TEST

    SELECT RowData from #TEST

    GO 18 -- makes a test of 1,310,720 records

    Part 2: the actual query. Note that I'm returning the results into a temp table to take the screen display of all these rows out of the equation.

    -- NOW, run just this portion

    -- if temp table already exists (failed previous run), drop it

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    if OBJECT_ID('tempdb..#test2') IS NOT NULL DROP TABLE #test2

    ;with CTE1 AS

    (

    select ID, RowData,

    Col1EndPos = CharIndex(',', RowData)

    from #TEST

    ), CTE2 AS

    (

    select ID, RowData,

    Col1EndPos,

    Col2EndPos = CharIndex(',', RowData, Col1EndPos + 1)

    from CTE1

    ), CTE3 AS

    (

    select ID, RowData,

    Col1EndPos,

    Col2EndPos,

    Col3EndPos = CharIndex(',', RowData, Col2EndPos + 1)

    from CTE2

    ), CTE4 AS

    (

    select ID, RowData,

    Col1EndPos,

    Col2EndPos,

    Col3EndPos,

    Col4EndPos = CharIndex(',', RowData, Col3EndPos + 1)

    from CTE3

    ), CTE5 AS

    (

    select ID, RowData,

    Col1EndPos,

    Col2EndPos,

    Col3EndPos,

    Col4EndPos,

    Col5EndPos = CharIndex(',', RowData, Col4EndPos + 1)

    from CTE4

    )

    select ID, --= ROW_NUMBER() OVER (ORDER BY RowData),

    [Account] = LEFT(RowData, Col1EndPos-1),

    [Year] = SUBSTRING(RowData, Col1EndPos+1, Col2EndPos-Col1EndPos-1),

    [Month] = SUBSTRING(RowData, Col2EndPos+1, Col3EndPos-Col2EndPos-1),

    [Scenario] = SUBSTRING(RowData, Col3EndPos+1, Col4EndPos-Col3EndPos-1),

    [Version] = SUBSTRING(RowData, Col4EndPos+1, Col5EndPos-Col4EndPos-1),

    [Entity] = SUBSTRING(RowData, Col5EndPos+1, len(RowData)-Col5EndPos-1)

    into #TEST2

    from CTE5

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    This returns the following results:

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 2 ms.

    Table '#TEST___000000000017'. Scan count 1, logical reads 10577, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1310720 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 18284 ms, elapsed time = 18684 ms.

    With an execution plan of (see "Execution Plan 1.jpg")

    If I remark out the "into #TEST2", I get the following results:

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (1310720 row(s) affected)

    Table '#TEST___00000000001A'. Scan count 1, logical reads 10577, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 17285 ms, elapsed time = 23992 ms.

    With an execution plan of (see "Execution Plan 2.jpg")

    (As you can see, it took an additional 5.3 seconds to pump the data to the screen vs. to a temp table.)

    And, finally, running the original "quirky update" method against this test data gets the following results:

    Table '#TEST___00000000001E'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1310720 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 8549 ms, elapsed time = 24874 ms.

    (1310720 row(s) affected)

    Table '#TEST___00000000001E'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 2090 ms, elapsed time = 21351 ms.

    With an execution plan of (see "Execution Plan 3.jpg")

    This takes the most time. And, it takes more logical reads - I assume because of the larger size of the #TEST table with the additional 5 columns.

    PS. Jason - just how do you go about embedding the execution plans inside the messages? I've seen you do this many times...

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • lmu92 (1/11/2010)


    CirquedeSQLeil (1/11/2010)


    ...

    Ok, I now have had some time to go back and redo the tests.

    I found one flaw in my tests - they were not equivalent. In the test for what I will call the Wayne Script, I was using a list of 6 comma separated values. For the CTE method, I was using 9. Thus the results would have been a little bloated.

    To try and make the CTE faster, I have done the following -

    Dumped the CTE results to a temp table. Temp table was created with as follows

    Create Table #tmp2 (id int identity(1,1) primary key clustered, stringid int , lft varchar(1000), rght varchar(1000), idx int)

    I changed that to

    Create Table #tmp2 (stringid int , lft varchar(1000), rght varchar(1000), idx int)

    Create clustered index IdxStringID on #tmp2 (StringID)

    That change bought me about 16 seconds in processing time.

    I then decided to try an additional NC index on the idx field. First, I tried it immediately after the table create - that slowed the query down by 16 seconds. Then I tried it after populating the temp table and it was better but still slower (by about 2 seconds). Thus I scrapped that index.

    Then I checked to see the results of creating the clustered index after the table insert. Again, that slowed it down by about 2 seconds overall.

    When I break the query apart (cte and table insert from the final select), then we can see that the CTE is obviously the bottleneck. The final select finishes in 2 seconds (the select that Lutz provided). This select statement is faster by 4 seconds that Waynes query.

    Then I went to adjust Waynes query. First, I moved the data creation and table creation out of the query. I created a similar Test table to hold the data with a primary key on an integer field with a clustered index.

    And now that they are a little more evenly matched - they both return in about 2 seconds (final selects) for 100,000 records. (I also removed the row_number from the query as Wayne suggested.)

    Overall, the setbased is faster. The final selects are comparable.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • And now, while I was typing up my last post, I see that Wayne has gone and updated the script he provided.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • WayneS (1/11/2010)


    Well, the concept comes from Jeff. 'nuf said. 😉

    :blush:

    Still, that's an interesting warp on the "quirky update". Nicely done, Wayne.

    It would be interesting to run a race between an nice "cross apply split" and this method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • @Wayne, Lutz, and Jason,

    When some of the Ninjas on this site did some extreme "split" testing, one of the things that we found out was that if all the rows have the commas all in the same place, the optimizer figured that out and made the split run as fast a "fixed field" example.

    You might want to try your tests with randomly variable length fields in the CSV.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Good feedback Jeff.

    Probably not random enough, but I changed the 5 initial rows (abc xxxxx, etc.) to be:

    1st row: abc 1,

    2nd row: zyx 100,

    3rd row: def 10000,

    4th row: ghi 1000000,

    5th row: jkl 100000000,

    This produced similar results (results being dumped into #TEST2):

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Table '#TEST___000000000021'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1310720 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 18252 ms, elapsed time = 18560 ms.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (1/11/2010)


    Good feedback Jeff.

    Probably not random enough, but I changed the 5 initial rows (abc xxxxx, etc.) to be:

    1st row: abc 1,

    2nd row: zyx 100,

    3rd row: def 10000,

    4th row: ghi 1000000,

    5th row: jkl 100000000,

    This produced similar results (results being dumped into #TEST2):

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Table '#TEST___000000000021'. Scan count 1, logical reads 13802, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1310720 row(s) affected)

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 18252 ms, elapsed time = 18560 ms.

    Nice work. It is nice to see why a certain method is better than another.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Jeff Moden (1/11/2010)


    WayneS (1/11/2010)


    Well, the concept comes from Jeff. 'nuf said. 😉

    :blush:

    Still, that's an interesting warp on the "quirky update". Nicely done, Wayne.

    It would be interesting to run a race between an nice "cross apply split" and this method.

    Any takers?

    Wayne, I found a really slow point in your second script (the million records). The table population stuff. You will probably notice throughout the site that there are some really fast methods to do this. Not that that section really affects the split script any - just a thought though.;-)

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply