Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Hi Jeff,

    After looking at it some more, I discovered an obvious tweak, I think you will like this A LOT.

    create FUNCTION dbo.DelimitedSplit8K_T0

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    select 1 union all -- does away with 0 base CTE, and the OR condition in one go!

    SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter)

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),

    Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

    FROM cteStart s

    ;

    I disposed of the need for a zero based tally, and also removed the need for the OR condition you were trying to get rid off, by moving the union all one step in the processing order, towards to the end. My earlier change to the top operator has also been included as well. Unless I messed things up a little, this version seems to run circles (15-20%) around the articles original code.

    Here is another alternative, often slightly faster still, but i would need to see comparative graphs to make sure its constant in its behavior.

    alter FUNCTION dbo.DelimitedSplit8K_T1

    --===== Define I/O parameters

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover VARCHAR(8000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

    select 1 union all -- does away with 0 base CTE, and the OR condition in one go!

    select t.N+1 from cteTally t where (substring(@pString,t.N,1) = @pDelimiter)

    ),

    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

    SELECT

    s.N1

    , ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

    FROM

    cteStart s

    )

    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteLen l

    ;

    In experimenting I found that even the slightest modification can totally wreck performance. This makes me a little unsure how well these routines will operate as a replacement of existing code.

  • Wow, what a difference 2 years makes:

    Jeff Moden, February 2009, http://qa.sqlservercentral.com/Forums/FindPost652105.aspx:

    With the exception of RegEx, damned near all CLR's are simply not the right way to do it.

    Jeff Moden, May 2011, http://qa.sqlservercentral.com/articles/Tally+Table/72993/[/url]:

    Make no doubt about it. The current best way to split delimited strings in SQL Server is to use a CLR splitter.

    And no -- this is meant to be neither prod nor provocation.:-D (Also, I don't think I've quoted out of context.)

    My point is that if one of the best and brightest among us (Jeff!) can (1) object to indiscriminate use of a new coding approach and then (B) nevertheless keep an open mind and see the utility of that new approach, then maybe I'll pause before I condemn the next "new thing" that comes along.

    Thanks,

    Rich

    P.S. With acknowledgments to Car Talk....

  • New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.

    Please, pretty please, Jeff 🙂

  • peter-757102 (5/6/2011)


    New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.

    Please, pretty please, Jeff 🙂

    Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:

    1) Do better than I did and make the graphs

    or

    2) post the raw CSV results here in a code block for others to view.

  • I did modify the test set and tested the code of course, I just want to see comparable graphs from identical servers as the others originated from. Maybe it is just better on servers configured like mine, and is terrible on others ;).

  • Nadrek (5/6/2011)


    peter-757102 (5/6/2011)


    New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.

    Please, pretty please, Jeff 🙂

    Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:

    1) Do better than I did and make the graphs

    or

    2) post the raw CSV results here in a code block for others to view.

    2. Do like the build script says: copy the data from your results grid into excel (with the column headers), then post the excel file here as an attachment.

    FYI: the graphs are a "Scatter" graph, with smooth lines and markers. They are a wee bit tedious to set up, but straightforward. Then do a screen shot of the excel sheet with the graph, paste into your favorite graphic editor, copy just the graph portion, and paste it into a new image. Save, and upload. Use the img tag in your post to have it display.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/6/2011)


    Nadrek (5/6/2011)


    peter-757102 (5/6/2011)


    New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.

    Please, pretty please, Jeff 🙂

    Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:

    1) Do better than I did and make the graphs

    or

    2) post the raw CSV results here in a code block for others to view.

    2. Do like the build script says: copy the data from your results grid into excel (with the column headers), then post the excel file here as an attachment.

    FYI: the graphs are a "Scatter" graph, with smooth lines and markers. They are a wee bit tedious to set up, but straightforward. Then do a screen shot of the excel sheet with the graph, paste into your favorite graphic editor, copy just the graph portion, and paste it into a new image. Save, and upload. Use the img tag in your post to have it display.

    It was not really straightforward (using open office), but I managed to get some graphs, see attachments.

    But now how do i link an attachment as an image on this forum?

    I see no straight way to do this, the img tag only adds some plain text!

    Ok, i figured it out, but it is a horrible interface 🙁

    The code was posted here: http://qa.sqlservercentral.com/Forums/FindPost1104575.aspx

  • Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

    Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.

  • Nadrek (5/6/2011)


    Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

    Interesting... I would be interested to see Peter's versions with a permanent Tally table. (I don't have time right now, but if nobody gets to it before this weekend I'll take a stab at it.)

  • Nadrek (5/6/2011)


    Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

    Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.

    Can you provide me a generation script for the exact tally table you used in your tests, I am then going to repeat my tests with your code added as well to see how it works on my system.

    I will also do a test with that order by you did not have in your select on the tally table and check if that makes a speed difference at all.

    Maybe one of the tricks I applied might also work on your code, and speed it up even more.

  • UMG Developer (5/6/2011)


    Nadrek (5/6/2011)


    Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

    Interesting... I would be interested to see Peter's versions with a permanent Tally table. (I don't have time right now, but if nobody gets to it before this weekend I'll take a stab at it.)

    I am more thinking along the lines of a modified tally table as was introduces a year or more ago. It has not only N, but also a N + 1 column, doing away with a calculation at the expense of some IO.

    Hard to imagine a simple +1 makes much computational difference, but removing such calculations seems to prevent the optimizer being thrown off track.

    We got to test it tho, with some code it might work wonders, on other not at all.

    Still, the nice thing about the CTE method is that you do not need a table at all, the function stands on its own, which can be quite valuable as a universal tool.

  • peter-757102 (5/6/2011)


    Nadrek (5/6/2011)


    Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

    Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.

    Can you provide me a generation script for the exact tally table you used in your tests, I am then going to repeat my tests with your code added as well to see how it works on my system.

    I will also do a test with that order by you did not have in your select on the tally table and check if that makes a speed difference at all.

    Maybe one of the tricks I applied might also work on your code, and speed it up even more.

    Can we get someone to put Peter's and Nadrek's code into a new test script, so that all can easily test it? Just modify Jeff's script to remove all but the DelimitedSplit8K, and add the four newer ones from Peter and Nadrek. If you need a tally table, it needs to be 1-10000, with a CI with 100% fill factor.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (5/6/2011)


    Can we get someone to put Peter's and Nadrek's code into a new test script, so that all can easily test it? Just modify Jeff's script to remove all but the DelimitedSplit8K, and add the four newer ones from Peter and Nadrek. If you need a tally table, it needs to be 1-10000, with a CI with 100% fill factor.

    I am working on that as we speak.

    update

    I made the script but am still in the process of optimizing / correcting a few things. Right now, there are too many processes running on my server to get reliable clean data for the graphs, so i will resume tomorrow. Hopefully by then, someone else will have also taken a second look and we can compare findings.

  • This thread is getting more and more exciting.

    Awesome.

    -- Gianluca Sartori

  • Gianluca Sartori (5/6/2011)


    This thread is getting more and more exciting.

    Awesome.

    I know what you mean. I, for one, am looking forward to the revealing of the ultimate DelimitedSplit function. It would be ultra-sweet if it were to handle varchar(max) also!

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 15 posts - 181 through 195 (of 981 total)

You must be logged in to reply to this topic. Login to reply