Serious Performance Issues with Compaq SAN and SQL

  • I work for a large company that has an OLTP SQL Server Cluster that I am solely responsible for.

    This environment has about 150 Databases serving different divisions within the organization.

    About 9 months ago, this environment was brought online using a Compaq Storageworks Enterprise Virtual Array. It is made up of 72 Disks in a RAID-5 Configuration for Random Read/Write Data and 18 Disks in a RAID-10 Configuration for Sequential activity only (i.e. - Logs).

    I have been experiencing a SQL Server/Disk IO problem for 10 weeks now and still there is no resolution from HP/Compaq on the issue.

    Here is a description of the problem:

    Those familiar with Windows 2000 Internals know that when Disk IO's occur, they are handed off to the Disk Controller(s) to be serviced. What is happening is that whenever Writes occur to the Data Segment, and are being handed off to the HBA's(Host Bus Adapters), the outstanding write queues

    go through the roof! I have experienced Oustanding Write Queue lengths of more than 170/sec sustained at times.

    This is also represented by the disks being utilized 100% according to Performance Monitor, but not via the SAN Monitoring tools.

    The facts we know about our issue are:

    1. The Fabric Switch Ports are not over-utilized

    2. Server-side resources such as memory, cpu, network, etc. are not breaking a sweat.

    3. The Disks on the SAN are serving other customers without any issues in regards to speed.

    4. Fibre has been replaced for both HBA's in the server(s).

    5. The latest Firmware is in affect on the Compaq EVA

    6. Any locally attached disks on the same SQL Server(s) do not reproduce the problem only SAN Disks

    7. This problem is reproducable on any SQL Server that we present SAN attached storage to.

    We have gone throught just about every configuration change from the HBA to the SAN Switches themselves to try and correct the issue, but none have proven to work. I have spun my wheels trying to convince HP/Compaq that the problem is real and that users are being impacted by it as perceived "Slow Response Times" on their end.

    Any insight from experienced SQL Server DBA's/System Engineers as to what could be causing the bottlenecks on the SAN would be greatly appreciated.

  • I guess my first question is what changed 10 weeks ago? This SQL7 or 2K? Seems like a great point to start with would be the discrepancy between perfmon and the SAN tools.

    Andy

    http://qa.sqlservercentral.com/columnists/awarren/

  • Nothing (Seriously Nothing) has changed on the SQL Server itself. There was a firmware upgrade on the SAN that Compaq/HP swears would not cause any of the issues, but the issues did not occur until after the upgrade.

    As for the discrepencies between perfmon and the SAN tools, the SAN tools do not show any problems whatsoever. Performance Monitor indicates that the Outstanding Write Queues are spiking at 130 - 160/sec randomly. They also indicate that the Physical Disk Utilization is at 100 Percent quite frequently, but only with the Data Volume (RAID-5 over SAN).

  • Call MS. They are really good about working with vendors on your behalf. I had a similar issue with Compaq Fiber Channel cards and MS called HP on it.Eventually HP admitted a hardware issue.

    Steve Jones

    sjones@sqlservercentral.com

    http://qa.sqlservercentral.com/columnists/sjones

    http://www.dkranch.net

  • Thanks Steve. The one thing that has not occurred at this point is an Open call to Microsoft since we have been dealing so much on the Hardware side.

    I will discuss with my associates and see what they think.

    Thanks

  • Assuming you are talking performance counter 'Avg. Disk Write Queue Length' under physicalDisk object. According to the definition of this counter -- "Avg. Disk Write Queue Length is the average number of write requests that were queued for the selected disk during the sample interval." so I guess there might be some disks that do not function properly in 72 disks RAID5.

    Is there possible to perform diagnosis to each individual disks of 72 disks to ensure them work properly with tools from vendor?

    What is the number of 'Avg. Disk Queue Length' when you saw high number in 'Avg. Disk Write Queue Length'?

  • The disks are all working properly.

    Any other clues?

  • What is the number of 'Avg. Disk Queue Length' when you saw high number in 'Avg. Disk Write Queue Length'?

    How much memory your server has and how much for SQL Server? What about cache hit ratio?

    Edited by - Allen_Cui on 02/19/2003 1:20:21 PM

  • My Server has a total of 6 GIG of RAM with 1 for the OS and SQL Server is fixed on 5 GIG.

    The Avg. Disk Queue length was 86.

  • I assumed /3GB and AWE are enabled in your server and SQL Server configuration. Can you confirm it?

  • No I cannot confirm that. I have the /PAE switch in my boot.ini and that is it.

    I have SQL Server fixed on using Physical Memory only and set at 5 GIG.

  • Are you sure you don't have AWE enabled in SQL Server?

    Why you don't have /3gb enabled?

  • What would this do for me exactly?

  • From BOL.

    "Using AWE Memory on Windows 2000

    Microsoft® SQL Server™ 2000 Enterprise Edition uses the Microsoft Windows® 2000 Address Windowing Extensions (AWE) API to support very large amounts of physical memory. SQL Server 2000 Enterprise Edition can access amounts of memory approaching 8 GB on Windows 2000 Advanced Server and approaching 64 GB on Windows 2000 Data Center.

    Standard 32-bit addresses can map a maximum of 4 GB of memory. The standard address spaces of 32-bit Microsoft Windows NT® 4.0 and Windows 2000 processes are therefore limited to 4-GB. By default, 2 GB is reserved for the operating system, and 2 GB is made available to the application. If you specify a /3GB switch in the Boot.ini file of Windows NT Enterprise Edition or Windows 2000 Advanced Server, the operating system reserves only 1 GB of the address space, and the application can access up to 3 GB. For more information about the /3GB switch, see Windows NT Enterprise Edition or Windows 2000 Advanced Server Help.

    AWE is a set of extensions to the memory management functions of the Microsoft Win32® API that allow applications to address more memory than the 4 GB that is available through standard 32-bit addressing. AWE lets applications acquire physical memory as nonpaged memory, and then dynamically map views of the nonpaged memory to the 32-bit address space. Although the 32-bit address space is limited to 4 GB, the nonpaged memory can be much larger. This enables memory-intensive applications, such as large database systems, address more memory than can be supported in a 32-bit address space. For more information about AWE, see the MSDN® page at Microsoft Web site."

    Simply said, If you want SQL Server use memory above 4GB, you have to enabled AWE in SQL Server configuration and /PAE in boot.ini.

    And /3gb is clear explained above.

  • Thanks for the explanation, however SQL Server is currently addressing the Full 5 GIG as the AWE option is enabled on it.

    As for Windows 2000 Advanced Server, I will look into adding the /3gb switch to the boot.ini

    Just for the record however, the server is not constrained in any way with Memory currently.

    Any other suggestions?

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply