Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Post reply

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 181 through 195 (of 981 total)

You must be logged in to reply to this topic. Login to reply

peter-757102 SSCertifiable Points: 6877 More actions · Answer 1

Hi Jeff,

After looking at it some more, I discovered an obvious tweak, I think you will like this A LOT.

create FUNCTION dbo.DelimitedSplit8K_T0

--===== Define I/O parameters

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

select 1 union all -- does away with 0 base CTE, and the OR condition in one go!

SELECT t.N+1 FROM cteTally t WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter)

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),

Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))

FROM cteStart s

;

I disposed of the need for a zero based tally, and also removed the need for the OR condition you were trying to get rid off, by moving the union all one step in the processing order, towards to the end. My earlier change to the top operator has also been included as well. Unless I messed things up a little, this version seems to run circles (15-20%) around the articles original code.

Here is another alternative, often slightly faster still, but i would need to see comparative graphs to make sure its constant in its behavior.

alter FUNCTION dbo.DelimitedSplit8K_T1

--===== Define I/O parameters

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

),

cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)

select 1 union all -- does away with 0 base CTE, and the OR condition in one go!

select t.N+1 from cteTally t where (substring(@pString,t.N,1) = @pDelimiter)

),

cteLen(N1,L1) AS(--==== Return start and length (for use in substring)

SELECT

s.N1

, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)

FROM

cteStart s

)

--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteLen l

;

In experimenting I found that even the slightest modification can totally wreck performance. This makes me a little unsure how well these routines will operate as a replacement of existing code.

Rich Mechaber SSChampion Points: 11096 More actions · Answer 2

Wow, what a difference 2 years makes:

Jeff Moden, February 2009, http://qa.sqlservercentral.com/Forums/FindPost652105.aspx:

With the exception of RegEx, damned near all CLR's are simply not the right way to do it.

Jeff Moden, May 2011, http://qa.sqlservercentral.com/articles/Tally+Table/72993/[/url]:

Make no doubt about it. The current best way to split delimited strings in SQL Server is to use a CLR splitter.

And no -- this is meant to be neither prod nor provocation.:-D (Also, I don't think I've quoted out of context.)

My point is that if one of the best and brightest among us (Jeff!) can (1) object to indiscriminate use of a new coding approach and then (B) nevertheless keep an open mind and see the utility of that new approach, then maybe I'll pause before I condemn the next "new thing" that comes along.

Thanks,

Rich

P.S. With acknowledgments to Car Talk....

peter-757102 SSCertifiable Points: 6877 More actions · Answer 3

New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.

Please, pretty please, Jeff 🙂

Nadrek SSC-Insane Points: 20039 More actions · Answer 4

peter-757102 (5/6/2011)
New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.
Please, pretty please, Jeff 🙂

Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:

1) Do better than I did and make the graphs

or

2) post the raw CSV results here in a code block for others to view.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 5

I did modify the test set and tested the code of course, I just want to see comparable graphs from identical servers as the others originated from. Maybe it is just better on servers configured like mine, and is terrible on others ;).

WayneS SSC Guru Points: 95405 More actions · Answer 6

Nadrek (5/6/2011)
peter-757102 (5/6/2011)
New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.
Please, pretty please, Jeff 🙂
Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:
1) Do better than I did and make the graphs
or
~~2) post the raw CSV results here in a code block for others to view~~.

2. Do like the build script says: copy the data from your results grid into excel (with the column headers), then post the excel file here as an attachment.

FYI: the graphs are a "Scatter" graph, with smooth lines and markers. They are a wee bit tedious to set up, but straightforward. Then do a screen shot of the excel sheet with the graph, paste into your favorite graphic editor, copy just the graph portion, and paste it into a new image. Save, and upload. Use the img tag in your post to have it display.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

peter-757102 SSCertifiable Points: 6877 More actions · Answer 7

WayneS (5/6/2011)
Nadrek (5/6/2011)
peter-757102 (5/6/2011)
New graphs would be nice, I got the feeling that the on-the-fly generated tally is now just as fast, if not faster then a 0 based physical tally table. The seedup numbers are in a similar range, if not a little higher.
Please, pretty please, Jeff 🙂
Do what I did when Wayne asked me: Download the Build script from Jeff's article, drop yours in, run the tests, verify that your assumptions hold, then export to CSV, and either:
1) Do better than I did and make the graphs
or
~~2) post the raw CSV results here in a code block for others to view~~.
2. Do like the build script says: copy the data from your results grid into excel (with the column headers), then post the excel file here as an attachment.
FYI: the graphs are a "Scatter" graph, with smooth lines and markers. They are a wee bit tedious to set up, but straightforward. Then do a screen shot of the excel sheet with the graph, paste into your favorite graphic editor, copy just the graph portion, and paste it into a new image. Save, and upload. Use the img tag in your post to have it display.

It was not really straightforward (using open office), but I managed to get some graphs, see attachments.

But now how do i link an attachment as an image on this forum?

I see no straight way to do this, the img tag only adds some plain text!

Ok, i figured it out, but it is a horrible interface 🙁

The code was posted here: http://qa.sqlservercentral.com/Forums/FindPost1104575.aspx

Nadrek SSC-Insane Points: 20039 More actions · Answer 8

Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.

UMG Developer SSChampion Points: 13482 More actions · Answer 9

Nadrek (5/6/2011)
Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.

Interesting... I would be interested to see Peter's versions with a permanent Tally table. (I don't have time right now, but if nobody gets to it before this weekend I'll take a stab at it.)

peter-757102 SSCertifiable Points: 6877 More actions · Answer 10

Nadrek (5/6/2011)
Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.
Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.

Can you provide me a generation script for the exact tally table you used in your tests, I am then going to repeat my tests with your code added as well to see how it works on my system.

I will also do a test with that order by you did not have in your select on the tally table and check if that makes a speed difference at all.

Maybe one of the tricks I applied might also work on your code, and speed it up even more.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 11

UMG Developer (5/6/2011)
Nadrek (5/6/2011)
Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.
Interesting... I would be interested to see Peter's versions with a permanent Tally table. (I don't have time right now, but if nobody gets to it before this weekend I'll take a stab at it.)

I am more thinking along the lines of a modified tally table as was introduces a year or more ago. It has not only N, but also a N + 1 column, doing away with a calculation at the expense of some IO.

Hard to imagine a simple +1 makes much computational difference, but removing such calculations seems to prevent the optimizer being thrown off track.

We got to test it tho, with some code it might work wonders, on other not at all.

Still, the nice thing about the CTE method is that you do not need a table at all, the function stands on its own, which can be quite valuable as a universal tool.

WayneS SSC Guru Points: 95405 More actions · Answer 12

peter-757102 (5/6/2011)
Nadrek (5/6/2011)
Peter: On my machine, you've definitely beat the article's original try, and beat my Union version, but the 0 based permanent temp table is still better.
Wayne: thank you for the information on graphing. I don't have time to do a comprehensive job, but here's some samples, attached, split between smaller and larger number of elements to make differences more obvious.
Can you provide me a generation script for the exact tally table you used in your tests, I am then going to repeat my tests with your code added as well to see how it works on my system.
I will also do a test with that order by you did not have in your select on the tally table and check if that makes a speed difference at all.
Maybe one of the tricks I applied might also work on your code, and speed it up even more.

Can we get someone to put Peter's and Nadrek's code into a new test script, so that all can easily test it? Just modify Jeff's script to remove all but the DelimitedSplit8K, and add the four newer ones from Peter and Nadrek. If you need a tally table, it needs to be 1-10000, with a CI with 100% fill factor.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

peter-757102 SSCertifiable Points: 6877 More actions · Answer 13

WayneS (5/6/2011)
Can we get someone to put Peter's and Nadrek's code into a new test script, so that all can easily test it? Just modify Jeff's script to remove all but the DelimitedSplit8K, and add the four newer ones from Peter and Nadrek. If you need a tally table, it needs to be 1-10000, with a CI with 100% fill factor.

I am working on that as we speak.

update

I made the script but am still in the process of optimizing / correcting a few things. Right now, there are too many processes running on my server to get reliable clean data for the graphs, so i will resume tomorrow. Hopefully by then, someone else will have also taken a second look and we can compare findings.

spaghettidba SSC Guru Points: 105712 More actions · Answer 14

This thread is getting more and more exciting.

Awesome.

-- Gianluca Sartori

WayneS SSC Guru Points: 95405 More actions · Answer 15

Gianluca Sartori (5/6/2011)
This thread is getting more and more exciting.
Awesome.

I know what you mean. I, for one, am looking forward to the revealing of the ultimate DelimitedSplit function. It would be ultra-sweet if it were to handle varchar(max) also!

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2