Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • ChrisM@Work (8/24/2012)


    A recent thread gives some indication of how democratised functionality can go horribly wrong.

    Crikey.

  • SQL Kiwi (8/24/2012)


    Usman Butt (8/24/2012)


    SQL Kiwi (8/23/2012)


    ChrisM@Work (8/23/2012)


    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    The NULL is not eliminated by the CROSS APPLY.

    I will second that CROSS APPLY will eliminate the NULL xml row as we are shredding the xml specifically for <r> node OR am I missing something?

    This is not a democracy. The NULL is on the driving side of the APPLY. Check the execution plan, two rows are produced and the joins are executed twice:

    I guess was not able to convey it properly :unsure: The only thing I was talking of that CROSS APPLY would work as INNER JOIN and hence the NULL xml row would be discarded automatically. So I assumed that CROSS APPLY is eliminating the NULL row? I hope this time I am able to convey it properly. Otherwise, I see myself in big trouble :sick:

  • Usman Butt (8/24/2012)


    I guess was not able to convey it properly :unsure: The only thing I was talking of that CROSS APPLY would work as INNER JOIN and hence the NULL xml row would be discarded automatically. So I assumed that CROSS APPLY is eliminating the NULL row? I hope this time I am able to convey it properly. Otherwise, I see myself in big trouble :sick:

    Well yes CROSS APPLY can be thought of as a correlated INNER JOIN, but the NULL is the correlated parameter of that join, not a join predicate. It sounds like you were expecting it to behave like joining on NULL equals <something>, which is not the case. The NULL is a parameter for one iteration of the 'correlated join'. I hope that makes sense. I think there's a reasonable APPLY article on SSC that might explain it better.

  • SQL Kiwi (8/24/2012)


    Usman Butt (8/24/2012)


    I guess was not able to convey it properly :unsure: The only thing I was talking of that CROSS APPLY would work as INNER JOIN and hence the NULL xml row would be discarded automatically. So I assumed that CROSS APPLY is eliminating the NULL row? I hope this time I am able to convey it properly. Otherwise, I see myself in big trouble :sick:

    Well yes CROSS APPLY can be thought of as a correlated INNER JOIN, but the NULL is the correlated parameter of that join, not a join predicate. It sounds like you were expecting it to behave like joining on NULL equals <something>, which is not the case.

    No I was not expecting it like that but I was rather short of words to convey the idea. :hehe:

    The NULL is a parameter for one iteration of the 'correlated join'. I hope that makes sense.

    This was kind of what I was trying to say. This is where your GURUism is so impressive. May be with my words not making much sense, talking of using the CROSS APPLY and OUTER APPLY for the same example could have convey my idea properly.

    I think there's a reasonable APPLY article on SSC that might explain it better.

    I guess I would be pinging the same author here again after that read 😀

  • SQL Kiwi (8/24/2012)


    Usman Butt (8/24/2012)


    I guess was not able to convey it properly :unsure: The only thing I was talking of that CROSS APPLY would work as INNER JOIN and hence the NULL xml row would be discarded automatically. So I assumed that CROSS APPLY is eliminating the NULL row? I hope this time I am able to convey it properly. Otherwise, I see myself in big trouble :sick:

    Well yes CROSS APPLY can be thought of as a correlated INNER JOIN, but the NULL is the correlated parameter of that join, not a join predicate. It sounds like you were expecting it to behave like joining on NULL equals <something>, which is not the case. The NULL is a parameter for one iteration of the 'correlated join'. I hope that makes sense.

    It does - in this case - behave in a way which might lead folks to believe that the CROSS APPLY was responsible for eliminating the UNIONed row:

    DECLARE @STR VARCHAR(8000)

    DECLARE @ItemID VARCHAR(8000)

    DECLARE @Item VARCHAR(8000)

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = REPLICATE(@str,32)

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    -- 833 rows

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    --UNION ALL SELECT NULL

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    -- 833 rows

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL

    ) cte (xmlstring)

    OUTER APPLY cte.xmlstring.nodes('r') x(i)

    -- 834 rows

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    --UNION ALL SELECT NULL

    ) cte (xmlstring)

    OUTER APPLY cte.xmlstring.nodes('r') x(i)

    -- 833 rows

    It looks to me that APPLY evaluates the row where Stringy = NULL and returns NULL, which is subsequently retained by OUTER APPLY and discarded by CROSS APPLY. For most of us this test would justify the statement, but there's clearly something I'm missing - time to get the books out.

    I think there's a reasonable APPLY article on SSC that might explain it better.

    Never heard of it 😉

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/24/2012)


    SQL Kiwi (8/24/2012)


    Usman Butt (8/24/2012)


    I guess was not able to convey it properly :unsure: The only thing I was talking of that CROSS APPLY would work as INNER JOIN and hence the NULL xml row would be discarded automatically. So I assumed that CROSS APPLY is eliminating the NULL row? I hope this time I am able to convey it properly. Otherwise, I see myself in big trouble :sick:

    Well yes CROSS APPLY can be thought of as a correlated INNER JOIN, but the NULL is the correlated parameter of that join, not a join predicate. It sounds like you were expecting it to behave like joining on NULL equals <something>, which is not the case. The NULL is a parameter for one iteration of the 'correlated join'. I hope that makes sense.

    It does - in this case - behave in a way which might lead folks to believe that the CROSS APPLY was responsible for eliminating the UNIONed row:

    Ahhh....It is perfectly normal behavior to me. I have used OUTER APPLY on various occasions while shredding the xml input due to the business needs. Take it like an iTVF function and the choice is yours whether you choose the CROSS APPLY OR OUTER APPLY depending upon you need all the rows returned OR only those rows with some corresponding value/s.

  • ChrisM@Work (8/24/2012)


    It looks to me that APPLY evaluates the row where Stringy = NULL and returns NULL, which is subsequently retained by OUTER APPLY and discarded by CROSS APPLY. For most of us this test would justify the statement, but there's clearly something I'm missing - time to get the books out.

    Remember APPLY takes each value in turn from the driving side and uses it as a parameter for the function on the right hand side. In this case, the driving side is the delimited string (and possibly a NULL), and the 'function' is the nodes() method call to the right of the APPLY operator. Whether the APPLY is CROSS or OUTER just determines what happens when a particular execution of the function returns no rows.

    The first iteration of the APPLY uses the delimited string as the driving input. Applying nodes() to that produces 833 rows. This is the same for all tests.

    The second iteration occurs only when the extra 'stringy' NULL is added. In that case, the nodes() function is executed on NULL and produces no rows. If the APPLY is CROSS, no extra row is added to the output. If the APPLY is OUTER, the no-row result of the function call results in a NULL-extended row being added to the overall APPLY result.

  • SQL Kiwi (8/24/2012)


    ChrisM@Work (8/24/2012)


    It looks to me that APPLY evaluates the row where Stringy = NULL and returns NULL, which is subsequently retained by OUTER APPLY and discarded by CROSS APPLY. For most of us this test would justify the statement, but there's clearly something I'm missing - time to get the books out.

    Remember APPLY takes each value in turn from the driving side and uses it as a parameter for the function on the right hand side. In this case, the driving side is the delimited string (and possibly a NULL), and the 'function' is the nodes() method call to the right of the APPLY operator. Whether the APPLY is CROSS or OUTER just determines what happens when a particular execution of the function returns no rows.

    The first iteration of the APPLY uses the delimited string as the driving input. Applying nodes() to that produces 833 rows. This is the same for all tests.

    The second iteration occurs only when the extra 'stringy' NULL is added. In that case, the nodes() function is executed on NULL and produces no rows. If the APPLY is CROSS, no extra row is added to the output. If the APPLY is OUTER, the no-row result of the function call results in a NULL-extended row being added to the overall APPLY result.

    That will do for me, for now. I've learned a useful terminology, the "driving side". Thank you, Paul.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/23/2012)


    But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

    This does just that:

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    Try this too:

    DECLARE @STR VARCHAR(8000);

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    WITH CTE (xmlstring) AS (

    SELECT CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' + LEFT(@@DBTS,0) AS XML))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

  • Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • ChrisM@Work (8/24/2012)


    SQL Kiwi (8/24/2012)


    ChrisM@Work (8/24/2012)


    It looks to me that APPLY evaluates the row where Stringy = NULL and returns NULL, which is subsequently retained by OUTER APPLY and discarded by CROSS APPLY. For most of us this test would justify the statement, but there's clearly something I'm missing - time to get the books out.

    Remember APPLY takes each value in turn from the driving side and uses it as a parameter for the function on the right hand side. In this case, the driving side is the delimited string (and possibly a NULL), and the 'function' is the nodes() method call to the right of the APPLY operator. Whether the APPLY is CROSS or OUTER just determines what happens when a particular execution of the function returns no rows.

    The first iteration of the APPLY uses the delimited string as the driving input. Applying nodes() to that produces 833 rows. This is the same for all tests.

    The second iteration occurs only when the extra 'stringy' NULL is added. In that case, the nodes() function is executed on NULL and produces no rows. If the APPLY is CROSS, no extra row is added to the output. If the APPLY is OUTER, the no-row result of the function call results in a NULL-extended row being added to the overall APPLY result.

    That will do for me, for now. I've learned a useful terminology, the "driving side". Thank you, Paul.

    I wonder if it's the other way around in the imperial countries? :-D:-D:-D

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (8/26/2012)


    Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    I did not even knew this existed, but if it is faster, it might be a way to force a recompilation with current values. @@Dbts i would expect is a non deterministic value.

    There are problably more needless appends you can do that can give similar effect (in testing). I woud take extra care in testing this to make sure there is no cheating happening ;). Constant folding might be at work here, so make sure the data to be processed is read form a table that is not created in the same batch!

  • peter-757102 (8/28/2012)


    Jeff Moden (8/26/2012)


    Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    I did not even knew this existed, but if it is faster, it might be a way to force a recompilation with current values. @@Dbts i would expect is a non deterministic value.

    There are problably more needless appends you can do that can give similar effect (in testing). I woud take extra care in testing this to make sure there is no cheating happening ;). Constant folding might be at work here, so make sure the data to be processed is read form a table that is not created in the same batch!

    It seems to be the case only with those non-deterministic functions which would give (almost) always a unique value and cannot be sniffed by the optimizer. For e.g. GETDATE(), NEWID() etc.

    Constant folding might be at work here, so make sure the data to be processed is read form a table that is not created in the same batch!

    Yes, absolutely correct. I did test almost the same inline code earlier (before this new code) but was too slow when physical table values are to be processed.

    Usman Butt (8/23/2012)

    I did some testing and shared it a while ago on this same thread but with some different code :hehe:

    I also tested this new inline code then but with the check needed not to handle special characters like "<" ">" etc. by

    using XML PATH. Otherwise, it may result in error OR false results.

    At that time it seemed slower in my testing so I stuck to Oleg's XML solution.

    But MTVF code speeded up things quite significantly which I shared earlier. For "not created in the same batch" part here is an example of the same batch table values split in which the appending no-length string have no effect 🙂

    DECLARE @STR VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    INSERT @table

    (

    [DelimitedStr]

    )

    VALUES (

    @STR

    )

    ;WITH CTE ([DelimitedStr], xmlstring) AS (

    SELECT [DelimitedStr],xmlstring FROM @table AS T

    CROSS APPLY(

    SELECT CAST('<r>' + REPLACE([DelimitedStr],';','</r><r>')+'</r>' + LEFT(@@DBTS,0) AS XML)) x (xmlstring))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

  • Usman Butt (8/28/2012)


    But MTVF code speeded up things quite significantly which I shared earlier. For "not created in the same batch" part here is an example of the same batch table values split in which the appending no-length string have no effect 🙂

    I have some good news for you:

    DECLARE @STR VARCHAR(8000);

    DECLARE @table TABLE

    (

    DelimitedStr VARCHAR(8000)

    )

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;';

    SET @STR = REPLICATE(@str,32);

    INSERT @table

    (

    [DelimitedStr]

    )

    VALUES (

    @STR

    )

    -- Fast

    ;WITH CTE ([DelimitedStr], xmlstring) AS (

    SELECT [DelimitedStr],xmlstring FROM @table AS T

    CROSS APPLY(

    SELECT CAST('<r>' + REPLACE([DelimitedStr],';','</r><r>')+'</r>' + LEFT(NEWID(),0) AS XML)) x (xmlstring))

    SELECT

    ItemID = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM CTE CROSS APPLY cte.xmlstring.nodes('r') x(i);

  • Jeff Moden (8/26/2012)


    Ok... I tried those two. Why does the LEFT(@@Dbts,0) speed things up so much?

    It's intriguing, enough to be the subject of my next blog post after the one I am writing now. The others on this thread came commendably close to explaining why (it does indeed have quite a bit to do with non-determinism, though not constant-folding). Y'all might work out the details in the meantime, but I'll post a link to my post anyway when it goes live. There's just not space here for a decent explanation.

Viewing 15 posts - 376 through 390 (of 981 total)

You must be logged in to reply to this topic. Login to reply