There Must Be 15 Ways To Lose Your Cursors... part 1, Introduction

  • I have to disagree.

    Apparently our feelings of disagreement are mutual. Is it not the case that in 4GL languages you tell the system what you want not how you want to get it? How is cross joining to sys.columns telling the system what I want? Perhaps you are getting thrown by the use of the word "simple." It would be more accurate to say that the cross join solution is not intuitive to most developers which is why most would consider it a "trick" or "hack". It is not just a factor of knowing how RANK and ROW_NUMBER work. I know how spoons work but it is not obvious that you can use a spoon to pop a beer cap.

  • Christian Buettner (4/15/2009)


    Hi Bruce,

    thanks for your suggestion. Unfortunately it seems to be more complex than necessary.

    [font="Verdana"]Christian,

    It's certainly more complex than the approach you are taking! The difference is in the level of performance for large volumes of data (we're loading data into a fairly large data warehouse.) When you need the performance, then adding the additional complexity makes sense. When you don't... why bother?

    Your way works and is simple. If you don't have a reason to change it, then why change? But for your scenario to work, you specifically require the row by row behaivour of something like a cursor.

    To me that seems to be an example of using a cursor because of it's inherent nature, and a reasonable choice.

    Having said all of that, I'm curious as to how MERGE (mentioned earlier by Joe Celko) could be used in this scenario. I believe that as it is currently implemented in SQL Server 2008, it wouldn't help (as the MERGE only has the one destination, and doesn't have a "for anything that causes an error" clause). I'm not familiar with DB2 so I can't comment on that.

    [/font]

  • Andy DBA (4/15/2009)


    Has anyone benchmarked aggregated dynamic sql against tuned RBAR or has everyone just used up all of their spare time reading this long winded post? 😉

    [font="Verdana"]I'm not sure it would make much of a difference. You are still calling the email stored procedure sequentially, whether it is inside a loop or as part of a large series of statements inside some dynamic SQL. Sure, in a non-cursor (or non-loop) manner, you may have eliminated a small portion of the overhead, but it will only be small.

    I think the cursor versus set for this particular approach is an empty argument.

    Moving to something like notification services where you can send out thousands of emails all at once... now that sounds like a performance gain to me.

    [/font]

  • I apologize if this is somewhat off topic and there's probably tons of articles on it already, but here is a word to the wise on performance testing. I noticed GSquared and others using GetDate() and DateDiff to capture execution times. I call this "wall-clock benchmarking". If you are the only user on your server and/or are taking averages of multiple tests your comparisons may be pretty good, but any other processes running on your server (including the OS!) can throw your results way off :w00t:.

    Someone with good expertise on the guts of SQL Server please feel free to jump in here, but I highly recommend querying master.dbo.sysprocesses with the @@spid system variable to get before and after resource usage values and then taking the average of multiple iterations. (see code for one iteration below) Also, don't forget about execution plan caching. Depending on what you're testing, you may want to throw out your first set of results.

    Here's the sql I suggest using to capture cpu usage and i/o. I think BOL explains exactly what these values mean, but for A/B comparisons on the same machine, the raw values are usually good enough.

    declare @start_cpu int

    declare @end_cpu int

    declare @start_io int

    declare @end_io int

    select @start_cpu = sum(cpu), @start_io = sum(physical_io) from master.dbo.sysprocesses where spid = @@spid

    /* Insert SQL to be performance tested here */

    select @end_cpu = sum(cpu), @end_io = sum(physical_io) from master.dbo.sysprocesses where spid = @@spid

    select @end_cpu - @start_cpu as cpu_used, @end_io - @start_io as io_used

    --Note: aggregation is probably not necessary, but if you're looking at a different spid, sysprocesses can sometimes returns multiple rows.

  • Bruce W Cassidy (4/15/2009)


    [font="Verdana"]I'm not sure it would make much of a difference. You are still calling the email stored procedure sequentially, whether it is inside a loop or as part of a large series of statements inside some dynamic SQL. Sure, in a non-cursor (or non-loop) manner, you may have eliminated a small portion of the overhead, but it will only be small.

    I think the cursor versus set for this particular approach is an empty argument.

    Moving to something like notification services where you can send out thousands of emails all at once... now that sounds like a performance gain to me.

    [/font]

    I agree 100%. In the case of email, who cares? Barry's article is about optimizing T-SQL code by eliminating cursors, not about how to best send e-mail from SQL Server (or if that's even where it should be sent from). My original post was about calling any stored procedure multiple times. sp_sendmail is a poor example of why you'd want to do this because its execution time dwarfs the overhead of how it's being called.

  • Andy DBA (4/15/2009)


    Bruce W Cassidy (4/15/2009)


    [font="Verdana"]I'm not sure it would make much of a difference. You are still calling the email stored procedure sequentially, whether it is inside a loop or as part of a large series of statements inside some dynamic SQL. Sure, in a non-cursor (or non-loop) manner, you may have eliminated a small portion of the overhead, but it will only be small.

    I think the cursor versus set for this particular approach is an empty argument.

    Moving to something like notification services where you can send out thousands of emails all at once... now that sounds like a performance gain to me.

    [/font]

    I agree 100%. In the case of email, who cares? Barry's article is about optimizing T-SQL code by eliminating cursors, not about how to best send e-mail from SQL Server (or if that's even where it should be sent from). My original post was about calling any stored procedure multiple times. sp_sendmail is a poor example of why you'd want to do this because its execution time dwarfs the overhead of how it's being called.

    !!!!!Touche!!!!!

    Just to think about the overhead of sp_sendmail makes me just want to move the process somewhere else.

    Can we please slow down here. My mailbox is getting full.. But it's been some ride I must say! Quite interesting indeed.

  • gautamsheth2000 (4/13/2009)


    Should use this code 🙂

    Select count(*) From master.sys.columns C1 CROSS JOIN master.sys.columns C2

    try

    select power(count(*),2) from master.sys.columns

  • Thomas (4/14/2009)


    Another common use of cursors is to script a DDL change. For example, I want to write a script that will alter all columns in all tables call "Foo" to be nvarchar(10) instead of nvarchar(5). That is simply not possible using a set based solution.

    Not sure what you are referring to here, with respect to Cursors superiority for this task. Could you post an example?

    Thanks,

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thomas (4/14/2009)


    A simple example of where set based operations break down is in very large data sets. I want to take a table of 10 million rows and copy it to another table in a production system. The obvious solution would seem to be something like:

    Insert DestTable

    Select ...

    From TableA

    Yet, at the number of rows we are discussing, this could take hours to execute even on the best hardware especially if the source table and the destination table are separated by a slower link.

    This is mentioned in my article. The reason that the set-based solution is a problem is because it attempts to perform the task as quickly as possible. Because even that is still quite a long time, it effectively (or sometimes literally) locks out access to the table for a very long time. The advantage of Cursors and WHILE loop approaches here is actually because they are so slow at it, they leave plenty of resource and access gaps for other process to get past them. And it is easy to use "chucks" and other techniques to make them even slower. And in this agree, as I said in my article, a Cursor is fine IF you actually want to make something slow.

    That said, it should be noted that there are also other ways to accomplish the same thing that do not use cursors.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Bruce W Cassidy (4/14/2009)


    Thomas (4/14/2009)


    Set-based operations do not natively provide for chunking and thus we have cursors. Of course there are other ways around this problem, but at the end of day the only reason we are forced to consider alternate solutions is because the Hoyle set-based solution is insufficient. Even using SSIS (which is frankly a heinous beast) is simply having another procedural language loop through chunks of set-based statements.

    [font="Verdana"]Agreed (both on SSIS and on the necessity for "chunking", although I call this batching (same concept). I wouldn't use a cursor for that though -- I would use a combination of TOP and a loop.[/font]

    And I would use a "Batch" Job for that: a SQL Agent recurring Job that also uses TOP. It has a lot of advantages around built-in monitoring and control. That is, the Job History log, job step output files and the abiltiy to change both its chunk size and recurrence non-disruptively, plus I do not have to keep a live client session open for it. It will survive crashes, power-losses and helpful janitors shutting down your workstation to save energy over the weekend. It really sucks to start a 48-hour chunking/batching process on Friday evening and find out on Monday morning that some minor glitch aborted it 2 hours in.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thomas (4/14/2009)


    Agreed. Here's another example. You want to build a numbers table that simply contains a series of sequential integers. There are many uses for such a table in solving problems such as finding gaps in a series of numbers. How do you populate it without using a cursor/loop? I'm sure we can piss away hours finding a solution however in minutes one could write the following and be done:

    Declare @i int

    Set @i = 0

    While @i < 10000

    Begin

    Insert Numbers(Value) Values(@i)

    Set @i = @i + 1

    End

    I'm all ears if there is a set-based solution (especially a SQL Standard solution) to populating said table. Assume you are starting with an empty database.

    Easy, and I don't bother with anything less than a million rows:

    Select TOP 1000000

    ROW_NUMBER() over(order by c1.object_id) as N

    into Numbers

    From master.sys.columns c1, master.sys.columns c2

    Let's see, ... that took me about 25 seconds to type and less than 1 second to run. Plus it's simpler because the cursor solution left out creating the table for me. The cursor method took just as long to type (longer with the CREATE TABLE) and took 14 seconds to run.

    In fact the set-based version is so fast that I don't even bother with the Numbers table most times, I just roll it into the CTE's:

    ;WITH cteNumbers as (

    Select TOP 1000000

    ROW_NUMBER() over(order by c1.object_id) as N

    into Numbers

    From master.sys.columns c1, master.sys.columns c2

    ), ...

    That's another thing that you cannot easily do with the Cursor version.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hi there,

    As an experienced DBA, I can identify with every single word in this article.

    Of course you can write your set-based application in such a way that cursor will perform much better. :hehe:

    But I can tell you - there is no way a cursor can beat a well-written set-based code in performance. You know what...? I will never say never... There are VERY few exceptions.

    And to be more productive - you can post an example and we will post recommendations or a new code (if it's not too long)... Don't forget to mention the SQL version.

    Cheers!

    Set-based-fan.

  • Thomas (4/15/2009)


    Couple of very simple solutions.

    First one is SQL 2005/2008 only, won't work in 2000:

    create table dbo.Numbers (Number int primary key);

    go

    insert into dbo.Numbers (Number)

    select top 10000 row_number() over (order by t1.object_id)

    from master.sys.columns t1

    cross join master.sys.columns t2;

    That does not qualify as "simple" IMO and is a bit of a hack, but I get the idea.

    It's every bit as simple as as the WHILE loop method especially when you count the Create Table that it left out, and it is Standard SQL as you requested. Since the WHILE loop is NOT ANSI Standard, technically, it is the hack.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • GermanDBA (4/15/2009)I know that if the series has the same quality that RBarryYoung produces here on a daily basis, it will be a wonderful primer for quite a few database developers that I know.

    Thanks, German (?), it's great to hear complements like that. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thomas (4/15/2009)


    I stand corrected. Cross joining on syscolumns does appear to be a bit faster. Perhaps this is one loop based solution that now has a reasonable set-based solution.

    Actually, at a million rows, the optimized cross-join solution that I posted is always at least 10x faster on every system that I have ever tested.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 166 through 180 (of 380 total)

You must be logged in to reply to this topic. Login to reply