Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • SQL Kiwi (8/28/2012)


    Usman Butt (8/28/2012)


    But MTVF code speeded up things quite significantly which I shared earlier. For "not created in the same batch" part here is an example of the same batch table values split in which the appending no-length string have no effect 🙂

    I have some good news for you:

    DECLARE @STR VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    INSERT @table

    (

    [DelimitedStr]

    )

    VALUES (

    @STR

    )

    -- Fast

    ;WITH CTE ([DelimitedStr], xmlstring) AS (

    SELECT [DelimitedStr],xmlstring FROM @table AS T

    CROSS APPLY(

    SELECT CAST('<r>' + REPLACE([DelimitedStr],';','</r><r>')+'</r>' + LEFT(NEWID(),0) AS XML)) x (xmlstring))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

    Good news not yet. :unsure: I did not opt this while testing since I want it to be an iTVF, So we know NEWID() cannot be used :(. I know the workaround of using VIEW but that too slows down the code pretty much. I hope you can get around this limitation as well 😎

  • SQL Kiwi (8/28/2012)


    Jeff Moden (8/26/2012)


    Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    It's intriguing, enough to be the subject of my next blog post after the one I am writing now. The others on this thread came commendably close to explaining why (it does indeed have quite a bit to do with non-determinism, though not constant-folding). Y'all might work out the details in the meantime, but I'll post a link to my post anyway when it goes live. There's just not space here for a decent explanation.

    Ohhhhh...Seems like we have to wait for a while 🙁 But good news is yet another blog of yours is coming up on sqlblog 😎 BTW, still trying to digest your excellent posts on temporary tables 😉

  • SQL Kiwi (8/28/2012)


    Jeff Moden (8/26/2012)


    Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    It's intriguing, enough to be the subject of my next blog post after the one I am writing now. The others on this thread came commendably close to explaining why (it does indeed have quite a bit to do with non-determinism, though not constant-folding). Y'all might work out the details in the meantime, but I'll post a link to my post anyway when it goes live. There's just not space here for a decent explanation.

    Is it something that could change characteristics based on the next change that MS might bless us with or do you think it's stable for the good part of the future? I ask because I tried other functions and string constants in the LEFT function that don't produce the same performance. Of course, those things were deterministic in nature. Heh... and yeah... I'm one that certainly knows that MS could change even documented features. I REALLY look forward to your blog post on the subject because I love this kind of stuff. Thanks, Paul.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (8/28/2012)


    Is it something that could change characteristics based on the next change that MS might bless us with or do you think it's stable for the good part of the future? I ask because I tried other functions and string constants in the LEFT function that don't produce the same performance. Of course, those things were deterministic in nature. Heh... and yeah... I'm one that certainly knows that MS could change even documented features. I REALLY look forward to your blog post on the subject because I love this kind of stuff. Thanks, Paul.

    Perhaps set up a poll and an easy to run test harness, so all of us can report back precise SQL Server versions and the results - I'd be worried that a previous MS change affected it, as well. As a community we have several versions, quite a few SP's, and many CU's in use.

  • Nadrek (8/28/2012)


    Jeff Moden (8/28/2012)


    Is it something that could change characteristics based on the next change that MS might bless us with or do you think it's stable for the good part of the future? I ask because I tried other functions and string constants in the LEFT function that don't produce the same performance. Of course, those things were deterministic in nature. Heh... and yeah... I'm one that certainly knows that MS could change even documented features. I REALLY look forward to your blog post on the subject because I love this kind of stuff. Thanks, Paul.

    Perhaps set up a poll and an easy to run test harness, so all of us can report back precise SQL Server versions and the results - I'd be worried that a previous MS change affected it, as well. As a community we have several versions, quite a few SP's, and many CU's in use.

    Excellent Idea. :Wow: But so far I have seen the same behavior on 2005, 2008, 2008 R2. But CUs, SPs etc. could change the behavior.

  • Jeff Moden (8/28/2012)


    Is it something that could change characteristics based on the next change that MS might bless us with or do you think it's stable for the good part of the future?

    It's not documented at the moment (I don't generally write about documented things) and the behaviour changed at least once (in SQL Server 2005) so yes, this is a curiosity and not something one could rely on. That's not to say there are no circumstances in which one might do so, but it comes with all the usual risks.

  • Usman Butt (8/28/2012)


    Good news not yet. :unsure: I did not opt this while testing since I want it to be an iTVF, So we know NEWID() cannot be used :(. I know the workaround of using VIEW but that too slows down the code pretty much. I hope you can get around this limitation as well 😎

    It's a bit tricky in an in-line TVF with column references; one option is to use Brad Shultz's trick (not quite as fast):

    DECLARE @STR VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    INSERT @table

    (

    [DelimitedStr]

    )

    VALUES (

    @STR

    )

    ;WITH CTE ([DelimitedStr], xmlstring) AS (

    SELECT [DelimitedStr],xmlstring FROM @table AS T

    CROSS APPLY(

    SELECT CAST('<r>' + REPLACE([DelimitedStr],';','</r><r>')+'</r>' AS XML).query('.')) x (xmlstring))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

    By the way, Oleg's XML function also fails when a semicolon is the delimiter and characters like '<' or '&' are present. In the version you posted, it had a comma hard-coded as the delimiter (the @Delimiter parameter in the function definition is unused).

  • SQL Kiwi (8/28/2012)


    By the way, Oleg's XML function also fails when a semicolon is the delimiter and characters like '<' or '&' are present.

    Yes, absolutely correct. This is why I said that RCTE version could be a better option unless one can lives with these caveats 😉

    In the version you posted, it had a comma hard-coded as the delimiter (the @Delimiter parameter in the function definition is unused).

    Yes, this was already highlighted. The funny thing was that the code I was using doesn't have the comma hard-coded :hehe: Should have cover it with the argument that it was deliberate in order to know someone used it? 😉 But my bad, I did not change it here then (I did it now). Thanks for pointing it out.

  • SQL Kiwi (8/28/2012)


    Usman Butt (8/28/2012)


    Good news not yet. :unsure: I did not opt this while testing since I want it to be an iTVF, So we know NEWID() cannot be used :(. I know the workaround of using VIEW but that too slows down the code pretty much. I hope you can get around this limitation as well 😎

    It's a bit tricky in an in-line TVF with column references; one option is to use Brad Shultz's trick (not quite as fast):

    DECLARE @STR VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    INSERT @table

    (

    [DelimitedStr]

    )

    VALUES (

    @STR

    )

    ;WITH CTE ([DelimitedStr], xmlstring) AS (

    SELECT [DelimitedStr],xmlstring FROM @table AS T

    CROSS APPLY(

    SELECT CAST('<r>' + REPLACE([DelimitedStr],';','</r><r>')+'</r>' AS XML).query('.')) x (xmlstring))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

    Ok. I think I am getting a bit more understanding of this behavior (Seems like it :hehe:). I have managed to use NEWID()’s zero length string in iTVF with the same speed as was for the inline query. Another good thing about this iTVF is that we are doing the entitization as well and not losing the execution speed.

    USE [tempdb]

    GO

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[XmlInlineSplitter]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))

    DROP FUNCTION [dbo].[XmlInlineSplitter]

    GO

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE FUNCTION [dbo].[XmlInlineSplitter]

    (

    @STR VARCHAR(MAX),

    @Delimiter VARCHAR(1),

    @NewId VARCHAR(50)

    )

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH CTE ( xmlstring) AS (

    SELECT CAST('<r>' + REPLACE((SELECT @STR FOR XML PATH('')),@Delimiter,'</r><r>')+'</r>' + LEFT(@NewId,0) AS XML))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i)

    GO

    SET NOCOUNT ON;

    DECLARE @STR VARCHAR(8000), @ItemId BIGINT, @Item VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    INSERT @table

    (

    [DelimitedStr]

    )

    SELECT REPLICATE('a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;', 32)

    UNION ALL

    SELECT REPLICATE('y;a;b;c;z;d;e;c;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;', 32);

    SET STATISTICS TIME ON;

    SELECT @ItemID = ItemID,@Item = Item FROM @table AS T

    CROSS APPLY dbo.XmlInlineSplitter([DelimitedStr], ';', NEWID())

    SET STATISTICS TIME OFF;

    I do not know BWAA-HAAA!!!! is the right word to use here but using it anyhow:-D:-D But couple of the many questions comes to mind are: Is it an opening of a new dimension? Should all other code be revisited where this undocumented hack could be utilized to speed things up?

    All these questions may be answered when your blog becomes live. Cannot wait for that. I hope you would be kind enough to share the in-depth details as soon as possible 😎

    Last but not least, I do not see such use of NEWID() at runtime is a different behavior, but just in case if it is different, then please shed some light on this as well. Thanks.

  • Usman Butt (8/29/2012)


    I do not know BWAA-HAAA!!!! is the right word to use here but using it anyhow.

    Ha! Very good. You know passing NEWID as a parameter is cheating, right? I won't go on about it, because I'm sure the limitations of this idea are very apparent to you. Aside from that, I dislike implicit conversions where than can be avoided. So, hoping you don't mind, here's a slightly tidied version, with the formatting I prefer, and explicit type conversion. Anyway, well done.

    IF OBJECT_ID(N'dbo.XmlInlineSplitter', N'IF') IS NOT NULL

    DROP FUNCTION dbo.XmlInlineSplitter;

    GO

    CREATE FUNCTION dbo.XmlInlineSplitter

    (

    @STR varchar(max),

    @Delimiter char(1),

    @NewId uniqueidentifier

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = T2.i.value('./text()[1]', 'varchar(8000)')

    FROM

    (

    SELECT

    CONVERT(xml,

    '<r>' +

    REPLACE(

    (SELECT @STR FOR XML PATH('')),

    @Delimiter,

    CONVERT(nvarchar(max), N'</r><r>')

    ) + N'</r>' +

    LEFT(CONVERT(nvarchar(40),@NewId),0)

    )

    ) AS T1 (xmlstring)

    CROSS APPLY T1.xmlstring.nodes('./r') AS T2(i);

  • SQL Kiwi (8/29/2012)


    Usman Butt (8/29/2012)


    I do not know BWAA-HAAA!!!! is the right word to use here but using it anyhow.

    Ha! Very good. You know passing NEWID as a parameter is cheating, right? I won't go on about it, because I'm sure the limitations of this idea are very apparent to you. Aside from that, I dislike implicit conversions where than can be avoided. So, hoping you don't mind, here's a slightly tidied version, with the formatting I prefer, and explicit type conversion. Anyway, well done.

    IF OBJECT_ID(N'dbo.XmlInlineSplitter', N'IF') IS NOT NULL

    DROP FUNCTION dbo.XmlInlineSplitter;

    GO

    CREATE FUNCTION dbo.XmlInlineSplitter

    (

    @STR varchar(max),

    @Delimiter char(1),

    @NewId uniqueidentifier

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = T2.i.value('./text()[1]', 'varchar(8000)')

    FROM

    (

    SELECT

    CONVERT(xml,

    '<r>' +

    REPLACE(

    (SELECT @STR FOR XML PATH('')),

    @Delimiter,

    CONVERT(nvarchar(max), N'</r><r>')

    ) + N'</r>' +

    LEFT(CONVERT(nvarchar(40),@NewId),0)

    )

    ) AS T1 (xmlstring)

    CROSS APPLY T1.xmlstring.nodes('./r') AS T2(i);

    Thanks a lot for the compliment 😎 Yes, your version is much tidier 😀

    But I have got two questions which I forgot before

    1. Does Replacing an expression from varchar(max) takes less time than nvarchar(max)? I have seen this behavior many times. Not to forget the Replace bug you pointed out. So should not the final version have the Binary Collation conversion? 🙂

    2. My thinking is that NEWID() update SQL server internal tables? This is why this function cannot be part of a user defined function? Also what about the RAND() function? Does it change something internally as well? If yes, then why it does not have the same behavior? I guess some part of this question is related to your blog, so not expecting much.

  • Usman Butt (8/29/2012)


    1. Does Replacing an expression from varchar(max) takes less time than nvarchar(max)? I have seen this behavior many times.

    The varchar version will be half the size of the nvarchar, so yes. The explicit conversions are to remove the CONVERT_IMPLICITs from the query plan; I didn't spend any time testing what was faster.

    Not to forget the Replace bug you pointed out. So should not the final version have the Binary Collation conversion?

    I didn't want to assume anything about the comparison semantic that would be appropriate for all users.

  • SQL Kiwi (8/29/2012)


    Usman Butt (8/29/2012)


    1. Does Replacing an expression from varchar(max) takes less time than nvarchar(max)? I have seen this behavior many times.

    The varchar version will be half the size of the nvarchar, so yes. The explicit conversions are to remove the CONVERT_IMPLICITs from the query plan; I didn't spend any time testing what was faster.

    I did some testing and explicit conversion to varchar(max) after entitization process optimized it quite a bit. So with your code formatting style and with no implicit conversions (hope so), this is what I have come up with

    IF OBJECT_ID(N'dbo.XmlInlineSplitter', N'IF') IS NOT NULL

    DROP FUNCTION dbo.XmlInlineSplitter;

    GO

    CREATE FUNCTION dbo.XmlInlineSplitter

    (

    @STR varchar(max),

    @Delimiter varchar(1),

    @NewId uniqueidentifier

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = T2.i.value('./text()[1]', 'varchar(8000)')

    FROM

    (

    SELECT

    CONVERT(xml,

    '<r>' +

    REPLACE(

    CONVERT(varchar(max),(SELECT @STR FOR XML PATH(''))),

    @Delimiter,

    CONVERT(varchar(max), '</r><r>')

    ) + '</r>' +

    LEFT(CONVERT(varchar(40),@NewId),0)

    )

    ) AS T1 (xmlstring)

    CROSS APPLY T1.xmlstring.nodes('./r') AS T2(i);

  • Usman Butt (8/29/2012)


    I did some testing and explicit conversion to varchar(max) after entitization process optimized it quite a bit. So with your code formatting style and with no implicit conversions (hope so), this is what I have come up with

    Yes that is better. Of course now someone will need to split:

    DECLARE @x nvarchar(max) = N'?????,??,???,????,??????,???????????,?????,??,?????,???';

    SELECT * FROM dbo.XmlInlineSplitter(@x, ',', NEWID()) AS xis;

  • SQL Kiwi (8/29/2012)


    Usman Butt (8/29/2012)


    I did some testing and explicit conversion to varchar(max) after entitization process optimized it quite a bit. So with your code formatting style and with no implicit conversions (hope so), this is what I have come up with

    Yes that is better. Of course now someone will need to split:

    DECLARE @x nvarchar(max) = N'?????,??,???,????,??????,???????????,?????,??,?????,???';

    SELECT * FROM dbo.XmlInlineSplitter(@x, ',', NEWID()) AS xis;

    hahahaha..but your version too started with varchar(max) as the parameter, so would the explicit conversion to nvarchar(max) would have matter :hehe:

    So for Unicode data, nvarchar(max) version could be something like

    IF OBJECT_ID(N'dbo.XmlInlineSplitter_For_Unicode_Data', N'IF') IS NOT NULL

    DROP FUNCTION dbo.XmlInlineSplitter;

    GO

    CREATE FUNCTION dbo.XmlInlineSplitter_For_Unicode_Data

    (

    @STR nvarchar(max),

    @Delimiter nvarchar(1),

    @NewId uniqueidentifier

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = T2.i.value('./text()[1]', 'nvarchar(max)')

    FROM

    (

    SELECT

    CONVERT(xml,

    N'<r>' +

    REPLACE(

    (SELECT @STR FOR XML PATH('')),

    @Delimiter,

    CONVERT(nvarchar(max), N'</r><r>')

    ) + N'</r>' +

    LEFT(CONVERT(nvarchar(40),@NewId),0)

    )

    ) AS T1 (xmlstring)

    CROSS APPLY T1.xmlstring.nodes('./r') AS T2(i);

Viewing 15 posts - 391 through 405 (of 981 total)

You must be logged in to reply to this topic. Login to reply