ADO recordsets and scaling

  • Big Foam Cluebat (see http://ars.userfriendly.org/cartoons/?id=20030311 and http://ars.userfriendly.org/cartoons/?id=20030211)

    Good for LART parties (Luser Attitude Readjustment Tool)

  • I liked that !!!!


    * Noel

  • He could just have said "Get a clue" .

  • The main objection to a single temp table is that a job may run to over 20 million rows, and I don't like 20 million row updates.  If I've gone to the trouble of breaking the job into 10K chunks, I want to do 10K row updates.

    There are naming problems because multiple jobs may be active at the same time.  The process that is assigned to the first batch of a job would have to create a uniquely-named temp table (probably a table in the target database rather than a # or ## table in tempdb).  All the processes would have to check to see if they just finished work on the last unfinished chunk of the job, so the update can be started and then wait to drop the table.

    I like the idea that when a chunk is marked as Finished in my task table, I know the updates have been committed.

  • Let me get something clear,

    #Temp Tables are unique per session!  if you have several jobs #T1 for job1 is not the same  than #T1  for Job2 even though they have the same name.

    The population of the temp table can be done massivelly as long as Tempdb can handle it othewise you can populate it in several trips.

     

    insert into #T1 (fld1, fld2,...)

               select v1,v2,...

    union all  select v1,v2,...

    union all  select v1,v2,...

    ...

    The Updates to the target tables are the ones that are done in chuncks! 

    ex:

    While @minpkey is not null and @maxpkey is not null

    begin

      Run the update

      getnext @minpkey and @maxpkey

     -- a delay here if you see it fit

    end

    drop table #T1

     

     

     


    * Noel

  • I know that #temp tables are unique per session.  I interpreted your suggestion to use a single temp table literally, so since there are multiple connections I assumed you must have meant a temporary table that all the connections inserted into.

    My control structure is very simple:

        Get the next chunk (for any job)

        If nothing to do, exit

        Process chunk and repeat

    To accumulate updates in a temp table it would have to be something like

        Get the next chunk (for any job)

        If nothing to do, exit

            Process chunk

            Get the next chunk for this job

            If nothing left for this job, start updates and wait for finish

            Drop temp table

        Repeat for next job

    This is not unreasonably complex, but I still don't like it.  Even if you put off the updates, eventually it becomes synchronous and you have to wait for the update to complete before continuing.  The longer you put off the update, the bigger (and eponentially longer) it will be.  Also, if anything happens to the client process, its temp table is orphaned and all the work is lost.

    I'm envisioning a grid computing style where the client processes are a service running on a number of desktop PCs (think SETI@home).  If one of them is shut down or aborts for some reason, the only loss is that chunk of the job.  (There will probably be a lot of reboots next week with the latest Windows updates, and it's thunderstorm season.)  A control process can periodically check for tasks that were started but not finished within a reasonble time and reassign them.

    I like the idea of inserting into a temp table and doing one update (and possibly using sp_executesql for the inserts, and then spooling the sp_executesql commands).  But I don't think you can talk me out of committing the updates for each chunk so it can be flagged as Finished.

  • Scott,

    What is your problem with synchronous updates. SET-BASED updates beat the hell out of single updates no matter what method you use!

    Lets make a quick comparisson:

    The temp table is just a way to have everything you need already on the server (A POWERFUL Machine -- just a reminder) and then is just a matter of sending a batch of code telling the Server to move the data to the appropriate place. 

    With the row by row you are going back and forth: update this, update this, update this, update this.. do you really think that is fast? I don't

    >>  Also, if anything happens to the client process, its temp table is orphaned and all the work is lost. <<

    Why do you say that? .. with single updates I would like to see what transaction handling you are doing with row by row and an error occurs on client side ?  what are your recovery mechanisnms?  

    What ever you are doing to recover one can be done with the other!

    >>and possibly using sp_executesql for the inserts, and then spooling the sp_executesql commands<<

    Nope! the fastest way to perform those insert is using a SET-BASED method. What you proposed will make one round trip per insert or per batch and generate a massive TLog activity. You need to think SET-BASED, Like I show you before:

     insert into #T1 (fld1, fld2,...)

               select v1,v2,...   --record 1

    union all  select v1,v2,... --record 2

    union all  select v1,v2,... --record 3

    One insert and all info is there with one round trip an ACIDity

     

    Just to give you  an example:

    I had to perform summarization job ( an SQL JOB!) over 10 millon rows in which delete of unsummarized data and inserts of the correct one had to be handled. The process went from 3hrs to 16 min

     

    With row by row you need much more luck (longer time without anything going wrong) than with a set-based

    Feel free to take or not my advice!

     

      


    * Noel

  • I wish I still had that link. But I once read a situation where a report was fetching (not update just select where pk=? which is faster than update) one row at the time, but millions of rows. The trace file for that report was over 10 mb (picture 10 mb of select statement). The report ran for 24 hours. The dba in place changed the select to a set based solution (which needed some agregates to be applied). The final time for the EXACT SAME REPORT : 3 minutes (500 times faster). How do you plan to get this much speed using the row by row method?????

    I don't care how complexe this thing his. You just won't get performance doing this row by row. You must explore the possibility to rewrite the logic into functions and forget the row by row approach. BTW I totally agree that external DLLS should not be allowed to be used on the server. That's why I said move the logic to functions (deterministics functions if at all possible). If this is sometime that will need to be done on a regular basis, and for the whole lifetime of the program, you can't ignore this possibility.

    Here's another situation similar to this one (look at the 2nd half of the thread). The final set solution took 26 seconds while the original code took 2 hours (semi set-based).

    http://qa.sqlservercentral.com/forums/shwmessage.aspx?forumid=8&messageid=197948

    Case closed .

  • I don't have a problem with a set-based approach, but I do have a problem with forcing the entire job to be done in one set.  If the size of an entire job can vary from hundreds of thousands to tens of millions of rows, I don't want to do that in one update.  Our budget does not support buying that kind of hardware.  Swallowing one 10-million row update makes extreme demands on the transaction logs.  We have had cases where updates of this size have tied servers up for hours, and possibly blown up if the transaction log disks filled up.  If your answer is to get bigger hardware, then I will happily wait for your check to pay for it.  But in the meantime, I've got to program for the hardware I have.

    Another point is that I'm not just copying 10,000,000 rows from a zero-latency source.  It takes a finite amount of time to prepare the next row or chunk of rows.  If the server can asynchronously perform the update on one set while the next is being prepared, then what's the problem?  I can still experiment with the size of each batch to optimize the throughput.  Certainly one 1-million row update is better than 1 million 1-row updates, but is it really THAT much better than 20 50,000-row updates or 100 10,000-row updates?  I think there has to be a happy medium somewhere.

    The recovery strategy is no mystery, if a client process fails then any uncommitted updates it had must be recomputed.  This is an almost unnoticable hit if it is one chunk, but a big deal if it represents hours of work.

    There is also a difference in user psychology.  A monitoring application that says the job is 65% done (really done, updates have been committed) and expected completion is at 3:45 this afternoon is good, one that says it is 65% of the way to swamping the server so everyone should plan a Starbucks run at 3:45 (and who knows when it will finish) is not so good.

    I knew from the beginning that I want this program to work in batches of some size that is yet to be determined.  But I knew that ADO reaches a certain level of performance doing row-by-row updates.  I wanted to come up with a program that surpassed that level of performance doing row-by-row updates before adding more complex optimizations.  I want to have that as a baseline to judge the performance of other approaches.  I never said that using row-by-row updates was going to be my final version, but I posted the original question when early testing showed I would need at least 100% better performance just to get back to where I started.

    Remi, your religious insistence on moving the job to the server is misguided.  You remind me of the saying, "To a man with a hammer the world looks like a nail."  I can top your example, I once took an accounting end-of-month procedure that (for our largest clients) would not finish after running all weekend (over 65 hrs) and turned it into one UPDATE that took three minutes.  Maybe its because I'm that good, or maybe its because I started with some truly horrible code, but that was a case where once you understood the data flow it was the poster child for row-based to set-based conversion.  This case is different.  If the CPU resources required to produce the values for each row dwarf the effort required to enter them in the database, then I would rather put the computation load on some under-utilized desktops than turn my SQL servers into application servers.

  • >>Certainly one 1-million row update is better than 1 million 1-row updates, but is it really THAT much better than 20 50,000-row updates or 100 10,000-row updates?  I think there has to be a happy medium somewhere.<<

    I think that you got the point of what I was trying to get across and that is enough for me. Definetly you need to know what your hardware can do and how fault tolerant it is to determine the right batch size only experimentation will give you the answer. True that a set base approach may not give you the 1% granularity you ar looking for in giving feed back to the user but still chuncks will give you some "%" number to show  Although this kind of job is (in my opinion) not suitable for a user interface you know your requirements better that I do

    I think you are on the right track

    Godd Luck!

     


    * Noel

  • "Remi, your religious insistence on moving the job to the server is misguided. You remind me of the saying, "To a man with a hammer the world looks like a nail." I can top your example, I once took an accounting end-of-month procedure that (for our largest clients) would not finish after running all weekend (over 65 hrs) and turned it into one UPDATE that took three minutes. Maybe its because I'm that good, or maybe its because I started with some truly horrible code, but that was a case where once you understood the data flow it was the poster child for row-based to set-based conversion. This case is different. If the CPU resources required to produce the values for each row dwarf the effort required to enter them in the database, then I would rather put the computation load on some under-utilized desktops than turn my SQL servers into application servers."

    I also know not to use a screwdriver to force in a nail .

    Anyways as Noeld said you know your system better than us and if you think/know that it's longer to compute than to update, I won't keep arguing, but as Noeld did I wanted to be sure you considered the set approach in depth before throwing it away. I often insisted like this just to find out that the poster was wrong and didn't consider the set soltion properly, or at all. But since you obviously made your homeworks I'll stop bugging you with that.

    Good luck with the rest of the project. Keep us posted when you have the final solution.

    TIA.

  • Everybody knows you don't drive nails with a screwdriver, that's what monkey wrenches are for!

    There is no user interface on the processing program.  That would get confusing with multiple copies on multiple computers.  The status of each task is maintained in a control table, and there is an optional monitor program that checks every few minutes to display progress.  The monitor can be stopped & started while the processing continues in the background.

  • >>Everybody knows you don't drive nails with a screwdriver, that's what monkey wrenches are for! <<

    Well, I have to admit that I have 

     


    * Noel

  • As long as you don't use the hammer for the screws of the server ...

  • There have been days in which I wished I have


    * Noel

Viewing 15 posts - 31 through 45 (of 45 total)

You must be logged in to reply to this topic. Login to reply