Table variables vs. temp tables, and should I use subqueries to minimize # of updates in these?

  • Hi, I would like some advice on the overall approach for a long stored procedure (approx. 700 lines) that I am trying to optimize in a SQL Server 2000 database. I am not including table definitions, index definitions, execution plans, etc., because I am looking for general guidance in my overall approach, rather than specific optimization.

    I have separate batches of raw data being inserted into the database from an external source, with each batch having 50-150 records in it. The stored procedure then processes a single batch (of 50-150 records) when called, and the stored procedure goes through approximately 110 different steps. Those steps mostly can't be done in parallel, because for instance I am updating one column and then in a subsequent step I need to use the updated value of the column.

    Because the processing was so hard to get my head around, I initially wrote it as a series of 110 different update statements, against the table of raw data (which has about 50 columns). So each of the update statements includes SQL like "WHERE BatchID = @BatchID" so it only goes against the current batch. This approach helped me break it down conceptually but it still took weeks to write. Thus this table contains not just the raw data, but also the various columns used for interim calculations. There are also some places where I need to do an interim aggregation calculation, so in places I also created some temp tables that were used for a few steps.

    As you might expect from this approach, in my production environment I am starting to get performance problems. Sometimes it is taking 15 seconds to run. Much better would be 5 seconds. So I am rewriting the stored procedure now so that it doesn't do so many consecutive updates into the raw table. I am unsure whether to make extensive use of subqueries to minimize the number of updates, and I am also unsure whether to use temp tables or table variables (though these two will be relatively easy to compare). Can anyone give me a ballpark estimate as to which of the following five approaches is thought to be best/worst?

    Approach A: Many consecutive updates within raw data table

    Approach B: Many consecutive updates into one temp table, then update raw data table at end

    Approach C: Many consecutive updates into one table variable, then update raw data table at end

    Approach D: Fewer updates (because of blocks of subqueries), into temp tables, then update raw data table at end

    Approach E: Fewer updates (because of blocks of subqueries), into table variables, then update raw data table at end

    Obviously I am currently doing Approach A. Here is pseudo-code to clarify what I mean by these five approaches:

    CREATE PROCEDURE spApproachA (@BatchID int) AS

    UPDATE D SET Col1 = [whatever], Col2 = [whatever] FROM tblData WHERE D.BatchID = @BatchID AND Col1 = [whatever]

    UPDATE D SET Col2 = [whatever], Col3 = [whatever] FROM tblData WHERE D.BatchID = @BatchID AND Col1 = [whatever]

    UPDATE D SET Col2 = [whatever], Col4 = [whatever] FROM tblData WHERE D.BatchID = @BatchID AND Col2 = [whatever]

    -- ,,,and then another 100 of these update statements

    GO

    CREATE PROCEDURE spApproachB (@BatchID int) AS

    CREATE TABLE #tmp (DataID int, Col1 int, Col2 int, Col3 int, Col4 int)

    INSERT #tmp (DataID, Col1, Col2, Col3, Col4)

    SELECT D.DataID, D.Col1, D.Col2, D.Col3, D.Col4 FROM tblData D WHERE D.BatchID = @BatchID

    UPDATE #tmp SET Col1 = [whatever], Col2 = [whatever] WHERE Col1 = [whatever]

    UPDATE #tmp SET Col2 = [whatever], Col3 = [whatever] WHERE Col1 = [whatever]

    UPDATE #tmp SET Col2 = [whatever], Col4 = [whatever] WHERE Col2 = [whatever]

    -- ,,,and then another 100 of these update statements

    UPDATE D SET D.Col1 = T.Col1, D.Col2 = T.Col2, D.Col3 = T.Col3, D.Col4 = T.Col4 FROM tblData D JOIN #tmp T ON D.DataID = T.DataID

    GO

    CREATE PROCEDURE spApproachC (@BatchID int) AS

    DECLARE @tvar TABLE (DataID int, Col1 int, Col2 int, Col3 int, Col4 int)

    INSERT @tvar (DataID, Col1, Col2, Col3, Col4)

    SELECT D.DataID, D.Col1, D.Col2, D.Col3, D.Col4 FROM tblData D WHERE D.BatchID = @BatchID

    UPDATE @tvar SET Col1 = [whatever], Col2 = [whatever] WHERE Col1 = [whatever]

    UPDATE @tvar SET Col2 = [whatever], Col3 = [whatever] WHERE Col1 = [whatever]

    UPDATE @tvar SET Col2 = [whatever], Col4 = [whatever] WHERE Col2 = [whatever]

    -- ,,,and then another 100 of these update statements

    UPDATE D SET D.Col1 = T.Col1, D.Col2 = T.Col2, D.Col3 = T.Col3, D.Col4 = T.Col4 FROM tblData D JOIN @tvar T ON D.DataID = T.DataID

    GO

    CREATE PROCEDURE spApproachD (@BatchID int) AS

    CREATE TABLE #tmp1 (DataID int, Col1 int, Col2 int, Col3 int, Col4 int)

    CREATE TABLE #tmp2 (DataID int, Col1 int, Col2 int, Col3 int, Col4 int)

    INSERT #tmp1 (DataID, Col1, Col2, Col3, Col4)

    SELECT SubQ3.DataID, SubQ3.Col1, SubQ3.Col2, SubQ3.Col3, SubQ3.Col4 FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM tblData D WHERE D.BatchID = @BatchID) SubQ1) SubQ2) SubQ3

    INSERT #tmp2 (DataID, Col1, Col2, Col3, Col4)

    SELECT SubQ3.DataID, SubQ3.Col1, SubQ3.Col2, SubQ3.Col3, SubQ3.Col4 FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM #tmp1) SubQ1) SubQ2) SubQ3

    -- ...and then another 20 or so of these blocks of subquery/inserts

    UPDATE D SET D.Col1 = T.Col1, D.Col2 = T.Col2, D.Col3 = T.Col3, D.Col4 = T.Col4 FROM tblData D JOIN #tmp2 T ON D.DataID = T.DataID

    GO

    CREATE PROCEDURE spApproachE (@BatchID int) AS

    DECLARE @tvar1 TABLE (DataID int, Col1 int, Col2 int, Col3 int, Col4 int)

    INSERT @tvar1 (DataID, Col1, Col2, Col3, Col4)

    SELECT SubQ3.DataID, SubQ3.Col1, SubQ3.Col2, SubQ3.Col3, SubQ3.Col4 FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM tblData D WHERE D.BatchID = @BatchID) SubQ1) SubQ2) SubQ3

    INSERT @tvar2 (DataID, Col1, Col2, Col3, Col4)

    SELECT SubQ3.DataID, SubQ3.Col1, SubQ3.Col2, SubQ3.Col3, SubQ3.Col4 FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM

    ( SELECT [whatever] FROM @tvar1) SubQ1) SubQ2) SubQ3

    -- ...and then another 20 or so of these blocks of subquery/inserts

    UPDATE D SET D.Col1 = T.Col1, D.Col2 = T.Col2, D.Col3 = T.Col3, D.Col4 = T.Col4 FROM tblData D JOIN @tvar2 T ON D.DataID = T.DataID

    GO

  • Well, for what it's worth, D would probably be the most likely to perform well, but I don't think that that is worth very much, because at this level of abstraction, it's still really at the "it depends..." stage.

    More to the point, however, all of these choices look awful, and indeed this whole approach seems to be terribly suspect. I have been doing this for a while, and yet I have never had to produce something like this on the database, primarily because when it looked like it was going to end up like this either I or someone else would insist that we took a step back and think about it long and hard, because down that path lies madness. In other words, I would seriously recommend that you try to widen the scope of your options and consider approaches that are more substantially different.

    The reason that I say this is that I have always felt that anytime you find that you have to do so much processing, re-processing, calculation, computation and state-management on the same small set of data, something is wrong, because as much as I love SQL, chewing over a single piece of data interminably like a dog worrying a bone is not it's strong suit. That's what general-purpose programming languages are good at. So maybe you either need to reorganize your whole approach to this in SQL so that you can reduce it to more straight-forward data transformations, or maybe you should just write a VB or C# program to chew on this data.

    Just my experience.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OPTION F... Single "Quirky Update"

    Problem: Post to too abstract to help.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I assumed (possibly incorrectly) that people would be more likely to answer a general question rather than puzzling through the details of my specific code. Is it really true that my post is so vague that there is not one of the five approaches that you would recommend, or a sixth approach such as "don't do it in SQL" as RBarryYoung suggested? Apparently so, and thus because I do value expert feedback so much, I will try to provide more concrete details.

    This information is for importing employee punch in/out data from a handscan machine. Every morning the batches are imported from each machine, so there are about ten batches that come in at the same time, each of which has 50-150 records. They provide the raw data that gets populated into tblHandscanData. This data is "processed" initially (i.e. the stored procedure is called) and also whenever someone comes in subsequently and modifies the raw data. So there are really only about 6-8 columns that are real data; the rest is interim calculation data. I went with a real table, and with all these update statements, because I wasn't sure an index on a temp table would be worth the performance hit to create it. Obviously I don't know too much about performance details which is why I am posting a question here.

    So this stored procedure is being called from a .NET web application maybe 30-60 times a day and sometimes it takes so long that it times out in the middle of the updates and I am left with a partially calculated table. Certainly this is a strong argument, setting aside performance considerations, for at least doing it in a transaction or doing an update all at once at the end. I appreciate RBarryYoung's point that he would have done it at the application level instead of just in SQL, but that's not really an option for me right now; I need to speed up the stored procedure SQL, unfortunately I don't have too much time to decide on the best way to rewrite it. I started rewriting it using Approach E and then realized that if Approach B was just as good as the others, I should do that because it would be so much quicker to implement. So I thought I would ask for help.

    I am attached the creation script for the underlying table being updated so much, as well as the stored procedure in question. There are lots of other tables being joined to at various points, which I guess was another reason to do this all in SQL.

  • Forums are typically used for simple, quick question/answer problems or questions. That is NOT what is requested here. People who are donating their time to help others could easily spend (many) hours trying to help you address your issue. That would mean they wouldn't be helping out many others with their forum efforts in the mean time. I recommend you get a professional to help you out on this one.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I am trying to be very conscious of the fact that people who are donating their time to the forums would prefer to answer simple questions rather than lengthy ones. I appreciate their time very much, and want to hear their answer. I was hoping a few people could take five minutes to read about my problem, and perhaps they had a ready response that was easily communicated. So far this has been the case for one responder (RBarryYoung), and not the case for two responders (Jeff Moden and TheSQLGuru). I really believe you can help me, and I don't understand whether to ask a simple abstract question or a detailed concrete question. I will try once again, with what feels to me like two simple, quick questions that are respectful of responders' times. Thanks in advance to anyone who can provide feedback!

    (1) I have a SQL Server 2000 dataset that is about 50 columns wide and about 100 records. I need to write a stored procedure that does a lot of processing of this data. I have considered leaving the data in its original, heavily-indexed table and running lots of updates on the data, or temporarily copying it into table variables or indexed temp tables for the duration of the processing. From my looking around the web, it seems that table variables are not good for this because they don't have statistics, and that I should use temp tables with indexes. Do people agree with this conclusion? Should I expect indexes on the temp tables to help performance when joining out to other tables?

    (2) If I am running 100 consecutive UPDATE statements, within a SQL Server 2000 stored procedure, against a temp table (or table variable) of 50 columns and 100 records, should I expect a performance gain from combining these into (roughly) 20 blocks of 5 nested subqueries, so that I am only updating 20 times instead of 100? I thought that if it was all happening in memory, maybe it wouldn't really matter. As an example, here is what I mean, where one approach is two consecutive UPDATE statements and the other is just one UPDATE statement, based on results from a subquery.

    UPDATE #tmp SET B=A WHERE B < A

    UPDATE #tmp SET C=B WHERE C < B

    vs

    UPDATE #tmp SET #tmp.B=SubQ.B

    , #tmp.C= CASE WHEN SubQ.C < SubQ.B THEN SubQ.B ELSE SubQ.C END

    FROM (SELECT RecordID

    , CASE WHEN B < A THEN A ELSE B END as B

    , C FROM #tmp) SubQ

    JOIN #tmp ON SubQ.RecordID = #tmp.RecordID

  • It looks like a fair amount of consolidation could be done on the proc to make it more efficient. For example, there are several update statements early on that have overlapping Where clause values, which could be combined into a single update, using Case statements for the non-overlapping Where tests.

    It also looks to me like inserting the data from the current batch, and the relevant rows from the prior batch, into a temp table, running the updates needed on the temp table, and then doing a single-pass update on the main table from the temp table, would also result in a potentially significant performance increase. That would primarily be of value if the main table has a lot of rows of data in it, or if its indexes are suffering from any significant fragmentation.

    Can you provide an insert statement with a dozen or so rows of sample data to populate the table? If so, then I can test some theories on it and see if I can offer more concrete advice.

    Does that help?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • jeff (8/16/2009)


    I assumed (possibly incorrectly) that people would be more likely to answer a general question rather than puzzling through the details of my specific code.

    To elaborate on my previous post, goto the Simple-Talk site and search for "Quirky Update". It's a good general answer to the general question you've asked... seriously.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Yes, you should expect better (often extraordinarily so) performance when using temp tables and joins instead of table variables.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Jeff Moden - Oops, sorry, I thought you were being frivolous before; was kind of hard to tell what you meant. Thanks! I will check that out.

    GSquared - Great, thanks! I am attaching a script to create the table and then insert the data for a typical batch (138 records) although I'm not sure how far you could get without scripts to create/populate the various other tables that I join to. I am also attaching my current attempt (incomplete) to combine my updates (where possible) using CASE statements and subqueries, as you suggested. The big decision here for me is whether to simply modify my existing UPDATE statements so they hit a temp table instead (relatively easy to do) versus this big rewrite with CASE statements and subqueries, which I am probably 80% done with. I'm not so worried about the choice between temp tables and table variables because that will be easier to switch between without significant rewrites, and I can just test the performance.

  • There are about 110,000 records in the table right now; I don't know how to tell about index fragmentation.

  • In cases where there are critical subtables, it's a good idea to include their definitions and sample data for those too.

    Without that, I can't test any of this, and that makes it impossible to really judge whether one solution is better than another. There's just no way to really tell without testing.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thanks GSquared! I appreciate you already taking the time to go through the first SQL and based on the feedback I've seen, I am going to use temp tables and combining the updates using subqueries (Approach D). Probably it's not worth it to run performance tests on the original code because it is definitely going away. So don't worry about testing it for me, but thanks again for the offer!

  • Just a few thoughts on the issue:

    1) Like Barry advised, you should think about the process and the purpose of the data again. What data are part of the scan file and what results do you need? Maybe the "cut" between the .NET application and the SQL Server is misplaced, meaning, the procedure might be easier if some of the .NET stuff could get included or some of the data manipulation could be excluded.

    When there's no chance to redesign the process you might want to consider the following:

    2) You should think about replacing the WHILE loop you're using. The "quirky update" reference Jeff talked about should help a lot.

    3) You're using a join with an OR condition and an IN clause (towards the end of your proc) within a 9 table join.

    Code Snippet:

    JOIN tsysSpecificDays PREVSD ON PREVSD.SpecificDayID BETWEEN (SD.SpecificDayID - 6) AND (SD.SpecificDayID - 0) AND

    (SD.DayOfWeek = 'SAT' AND PREVSD.DayOfWeek IN ('SUN', 'MON', 'TUE', 'WED', 'THU', 'FRI', 'SAT'))

    OR (SD.DayOfWeek = 'FRI' AND PREVSD.DayOfWeek IN ('SUN', 'MON', 'TUE', 'WED', 'THU', 'FRI'))

    OR (SD.DayOfWeek = 'THU' AND PREVSD.DayOfWeek IN ('SUN', 'MON', 'TUE', 'WED', 'THU'))

    If you replace the datename of the weekday by the number of the weekday you'll get

    SD.DayOfWeek = 6 AND PREVSD.DayOfWeek IN (0,1,2,3,4,5,6))

    OR (SD.DayOfWeek = 5 AND PREVSD.DayOfWeek IN (0,1,2,3,4,5))

    OR (SD.DayOfWeek = 4 AND PREVSD.DayOfWeek IN (0,1,2,3,4))

    ...

    See the pattern?

    Wouldn't a single statement like SD.DayOfWeek >= PREVSD.DayOfWeek have the same effect, assuming you'd use the day number rather than the day name?

    I'm by far no performance tuning guru but those are some of the patterns I came across while "scanning" the procedure. Please note that I'd follow the steps described in listed order though ... 😉



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • jeff (8/17/2009)


    I am trying to be very conscious of the fact that people who are donating their time to the forums would prefer to answer simple questions rather than lengthy ones.

    Actually, most of us probably prefer ones more in the middle. 🙂

    I appreciate their time very much, and want to hear their answer. I was hoping a few people could take five minutes to read about my problem, and perhaps they had a ready response that was easily communicated. So far this has been the case for one responder (RBarryYoung), and not the case for two responders (Jeff Moden and TheSQLGuru). I really believe you can help me, and I don't understand whether to ask a simple abstract question or a detailed concrete question. I will try once again, with what feels to me like two simple, quick questions that are respectful of responders' times. Thanks in advance to anyone who can provide feedback!

    Don't sweat it too much Jeff, you're doing fine. But it's a tough problem and it might take some time and a little back and forth for us to find out what we need to and to work out the answer that helps you best.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply