VMWare Woes....ARGH.......

  • Greetings,

    Would appreciate any feedback on my problem, I think I've diagnosed it correctly, but want to get some feedback from you guys to make sure that I'm not crazy, as my client is in denial of the issue.  Here are the basics:

    -  Getting SQL server timeout errors from a .NET application.  It is failing in different areas of the program, but the common denominator is that when it does in fact fail, it is always during an intensive session with the database (using SP's to update ~150,000 records, using very simple update commands, nothing fancy).  The error is a SQL Server timeout from .NET which indicates that SQL Server simply failed to respond in the allocated time (30 seconds)

    - SQL Server is sitting on VMWare ESX, the virtual machine that the instance is running on is one of five (5).  They have not provided the specs of the hardware as requested, but have disclosed that all five are using the same controller card and there are just a couple of disks which are physical to the machine.  They have not disclosed what the other virtual machines are running either.  From what I can gather, it does not appear to be a monster performance machine. 

    - The virtual machine which SQL Server 2000 is running under has 2GB allocated to it, and as would expected, SQL Server uses every bit that it can.  This is not unexpected.  Originally, they'd only give me 1GB, then 1.5GB and now 2GB.  There seems to be some sort of rationing on this, so not sure that they have a lot on the machine (see my comment about the requested spec).

    - The process which is failing is a long running process, which gives us some good data for the perfmon.  When I run this, I monitor a few key items, one of them being the disk queue length.  When the program gets the SQL Server timeout, this metric is up to about 60.  Which MSDN indicates should never be more than two times the number of spindles on the drive.  This appears to be the smoking gun in my mind, as this is when it fails.  When it runs fine, the queue length hovers around 1-2. 

    - We never had this problem on the standalone server prior to moving to VM Ware, the code is unchanged.  The only change was that on the standalone, MSDE was being used instead of full blown SQL Server, this should not matter, in fact moving to full blown is a positive.

    - Of course, this is a production server

    - Initially, the problem was intermittent on VMWare.  About 33% of the time, it would fail, even with no data changes from the good run.  Since the return code is the timeout, it is clear to me that it is not the code which is throwing this, otherwise it would throw it all of the time. 

    - To further exacerbate the problem, they have recently installed a piece of software (the name escapes me now) which hot-mirrors the server to an offsite location.  This is a disk block by block copy.  Now I can't get a single good run it at this point, they all fail with timeouts.

    My Hypothesis:  Based upon the diagnostics, SQL Server is having to wait on a hardware bottleneck, relating to disk access, which is preventing it from using the pagefile as well as actually committing the data to disk.  This most likely is related to some other intensive process on other virtual machines, since they are sharing the same hardware.  At some point, when the queue is too long, SQL Server simply does not respond and this causes uses the application to timeout as described.  

    Questions:

    Anybody have any similar experience?  Am I totally off base here?  I've requested them to provide additional performance monitoring via the VMWare diagnostic tools, but they continue to state that it is the code which is causing the problem and have not provided these diagnostics or any evidence to the contrary.  I'd be more than happy to look in any direction, but all metrics lead down the path that I've followed.

    If the code is the culprit, I guess the sin is to ask for the database to update the records?  I disagree with this notion.  From everything that I've seen and the changes in the environment (which are the only changes), it seems to me that there is only one culprit and that is the VMWare/hardware configuration, as it appears to me to be undersized.  But how do you tell them this, you are putting your hands into someone else's ricebowl?

     

     

     

     

  • if you share stuff performance is rubbish. My experiences with vmware show it's ok in dev but absolutely not good for production.

    To prove it's a disk issue monitor the i/o completion time, I never bother with disk queues, i/o completion time should be around 6ms ( ish ) it's the best indication of poor disk sub system performance.

    The code could be at fault too - you have to monitor that yourself using query plan analysis.

    do the i/o completion times, cpu usage, monitor disk idle time ( this is a much better indicator than disk usage ). SQL should be ok with memory on vm ware I never found that to be an issue, you might want to check network card utilisation , this could also be swamped. It never fails to amaze me how little people who set up servers/hardware seem to know about rdbms requirements.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply