SQL Server was hung, need to find root cause

  • Hello All,

    Yesterday one of our SQL Server was hung and not able to RDP and connect SQL remotely, Windows team notified us TCP connections were full in resource monitor and CPU was high but was changing not constantly at 100%.

    what we have done was to cancel the commvault backup running and then we were able to RDP and Application jobs went fine

    In the error logs we noticed only I/O taking longer than 15 seconds

    SQL Server has encountered 4084 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file ... The OS file handle is 0x00000000000012F8. The offset of the latest long I/O is: 0x000018c04a0000.

    Recently we had storage migration from HP to XtremeIO.

    Also we use 118GB for Max memory of SQL Server out of 128GB and during the issue we observed that Memory usage was 97% used and generally we have 96% used always. FYI .. this is VM Server.

    Suggest us how the TCP connections in resource monitor and SQL Server are related and also whether there is any problem from storage side and how to clearly say that it is storage problem.

    ++ adding the events observed in eventvwr

    Observed below events in eventvwr till the issue was resolved, after that there were no below occurrences observed, was there anything that Windows/VM team could have done to stop this. Please help on this.

    The IO operation at logical block address 0x520de7e0 for Disk 31 was retried.

  • Hi, did this message - "SQL Server has encountered 4084 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file" occur only when the CommVault backup was running?

  • Yes.. that was during the same time we observed the I/O errors but we are observing huge I/O errors on tempdb file as well around 30000... during no backup runs

  • It sounds like there is definitely an issue with storage for that vm. I would maybe have the VM team move this machine to a different host or possibly different storage. Here is another query you could run which shows IO latency per database which may be helpful. Any value over 100 is an area to look into keeping in mind the values returned by the query are as good as the last sql restart -

    SELECT DB_NAME ([vfs].[database_id]) AS [DB],

    [BytesTransferred] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0

    ELSE ([num_of_bytes_read] + [num_of_bytes_written]) END,

    [BytesRead] = CASE WHEN [num_of_reads] = 0 THEN 0

    ELSE [num_of_bytes_read] END,

    [BytesWritten] = CASE WHEN [num_of_writes] = 0 THEN 0

    ELSE [num_of_bytes_written] END,

    [ReadLatency] = CASE WHEN [num_of_reads] = 0 THEN 0

    ELSE ([io_stall_read_ms] / [num_of_reads]) END,

    [WriteLatency] = CASE WHEN [num_of_writes] = 0 THEN 0

    ELSE ([io_stall_write_ms] / [num_of_writes]) END,

    [Latency] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0

    ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,

    [AvgBPerRead] = CASE WHEN [num_of_reads] = 0 THEN 0

    ELSE ([num_of_bytes_read] / [num_of_reads]) END,

    [AvgBPerWrite] = CASE WHEN [num_of_writes] = 0 THEN 0

    ELSE ([num_of_bytes_written] / [num_of_writes]) END,

    [AvgBPerTransfer] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0

    ELSE (([num_of_bytes_read] + [num_of_bytes_written]) /([num_of_reads] + [num_of_writes])) END,

    LEFT ([mf].[physical_name], 2) AS [Drive],

    [mf].[physical_name]

    FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]

    JOIN sys.master_files AS [mf]

    ON [vfs].[database_id] = [mf].[database_id] AND [vfs].[file_id] = [mf].[file_id]

    -- WHERE [vfs].[file_id] = 2 -- log files

    ORDER BY [Latency] DESC

    -- ORDER BY [ReadLatency] DESC

    --ORDER BY [BytesRead] DESC;

    GO

  • Assuming there is no failover clustering or AG involved right? Was the SAN online at this time?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply