Reoccuring I/O Problems.

  • Hello,  I have a high transaction imaging server that recently of last week has been having intermittent I/O problems.  I am told there have been no changes in the hardware and software of the systems.  (As usual.) When this I/O problem occurs, the imaging app crashes and has to be restarted.  The time of the crash is the exact time the error is logged in the SQL 2000 Error log.  Here is an eample of the error(s) I am receiving.

    SQL Server has encountered 2 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [G:\Microsoft SQL Server\MSSQL$PSQL01\Data\Subscr_sec01_par01.ndf] in database [Subscr] (18).  The OS file handle is 0x0000054C.  The offset of the latest long IO is: 0x0000003b1b8000

    As far as this error is concerned, it happens on multiple drives and with different databases including tempdb.  The number of occurences changes also but mostly 1.  Tempdb and the database(s) log and data files are on different volumes.

    My network expert says he sees nothing abnormal in the logs of the SAN or it's counterparts other than we have around 1500 connections at any given time.

    I have increased SQL memory, changed from 100Mb to 1Gb connection, rebooted the server, compressed and truncated all data and log files, added an additional datafile and have increased the preallocated size of tempdb, ensured all files can autogrow where most of them are set at 10% and defragmented all the drives. 

    So far, since the weekend, I don't have too many complaints of the system being down but the network guys say that having these intermittent I/O errors can be normal during high load periods.  I don't totally agree with that.  So, I thought it would be best to ask the experts out there if they have any suggestions or recommendations to a solution that we have not touched yet.  Anybody else have this type of issue?  Any help would be much appreciated.  Thanks.

  • I hope you mistyped/misspoke and didnt compress the log files via ntfs or any other method. A) it's not supported B) Your log files absorb the majority of your I/O activity and you added significant overhead by adding compression.

    Uncompress your log files ASAP - move them to a physical differnt drive form your data files (*.mdf, *.ndf)

    in case you use raid redundancy on a physical drive level - use raid 1 or 10 for the volume which will hold your log files.

    http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx

    http://support.microsoft.com/?kbid=897284

    Good luck

     

  • Thanks,  I misspoke and did not compress any of the volumes using NTFS or anything.  I meant I compressed the database related files i.e compress pages and then truncated free frace at the end.   My log files are on different drives than my data files by letter but I'm going to check with network to be sure they are actually on different spindles.  I'll also check what raid level is in place under the hood.  Thanks also for the links which I'll review.

  • Interesting - we're having EXACTLY the same problem right now.

    In our case, we believe that we've traced the problem to some failing disks on our SAN array (or the SAN vendor has, at any rate), although this has yet to be confirmed. Another possibility we've seen is that all affected disks (i.e. all the ones used by the affected databases) use iSCSI connections rather than fiber or SCSI. If replacing the failing disks isn't the answer, we're going to segregate the iSCSI traffic from the rest of the network traffic (which would be good practice in any case).

    We're seeing these errors when running high-intensity batch jobs on those disks, and also when trying to write sqlserver database backup files to them.

    Our RAID setup is a bit specialised. It's supplied by a company called NetAPP, and is essentially RAID4, but with dual-parity, and uses a filesystem called WAFFLE (Write-Anywhere Filesystem). This means that both the data and the parity can be written to any portion of any of the disks in the array, and the SAN keeps track of where all these blocks are being placed. The choice of location is calculated based on the proximity of a particular disk head to the most optimal portion of a disk, the number of blocks to be written, etc., etc.

    This allows for lightning fast response times when everything's working correctly. It also has a large read/write cache so that the physical writes are written to disk asynchronously, but the application gets a "completed" signal as soon as the data has been placed in the cache.

    Given all this, there shouldn't really be much scope for ANY delayed IO at all, let alone 15 seconds-worth, but we are getting it. Checking Perfmon and monitoring the CurrentDiskQueue counters, we are regularly exceeding values of 150 or more.

  • We have had a similar issue on a server that is SAN connected.  However all of our I-O errors were in tempdb which uses the only local disks on the server.

    The problem started happening after we applied SP1 to W2003 along with HBID driver updates.  A number of servers were upgraded in the same way but only 1 was affected.  After a lot of prodding and poking of the hardware, logs, etc, our infrastructure people decided to rebuild the OS.

    We use an automated build process using Altris, so the problem build should have been exactly the same as other builds that worked.  However, the Windows re-install seems to have fixed the problem.

    All I can say is: look very hard for a logical cause of the problem, as it is most likely to be a hardware or driver issue.  If all else fails, consider a Windows rebuild.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Is it possible to monitor the IO against individual datafiles?

  • Hello and thanks for getting back to me.  My situation is becoming serious as there are more errors and user complaints now.  I will need to find a way to figure out if the SAN drives are degrading in performance as recommended.  At this point, unless I come up with a super good reason, I can't reinstall the OS.   I did find out that all my drives are on different spindles and that all the spindles are on RAID5.  Next time we will set the log files to be RAID 1 or 10.  I also have contacted Microsoft and they had me run a utility called IOSTRESS.  It produces and file where it is supposed to read and write to the file subsystem collect performance data and compare it to a calculated baseline based on current hardware and software configuration.  I am waiting for a repsonse from Microsoft as to their analysis.  The SAN information provided I am passing on to my network guys.  I will let everyone know of todays progress.  Thanks Again.  

  • We have an older database server that was built on RAID 5 that is getting the same type of I/O errors during database backups.  This server isn't particularly critical so it will not be rebuilt so I'm stuck with the errors.  I'm pretty confident that it is the combination of the older hardware and the RAID configuration that is the cause of the errors.

  • Monitoring i/o completion time is critical to performance, if this figure rises, typically above, say, 6ms then the disk sub system cannot support the level of i/o requests. I've found SAN's to be especially prone to this as most SAN's do not seem to be set up to accomodate rdbms technology. Using Raid 5 is another sure way to generate this error for disk writes in a highly transactional environment. If your system requires high transactional throughput and/or sustained bursts of throughput, especially writes, then only raid 10 and ample spindles will suffice.

    I've been using raid 10 since sql 6.0, Just read through the sql 2k admin companion or inside sql 2k or check out the ms. kb articles on disk io basics.

    only a READ ONLY database should be on raid 5.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Greg

    Check out the first topic on this page:

    http://blogs.msdn.com/sqlserverstorageengine/default.aspx

    John

  • Hello,  That blog article was fantastic!  I like a more technical explanation as to what may be happening.  Here is the update for today.  I am still waiting for Microsoft to come back with the analysis of the IOStress Utility I ran.  They said they would have it this AM and MS IO performance engineers have a new plan of attack for me.  My storage expert says we have not gone raid 10 because it is too expensive and we have not gone raid 1 for reliability reasons so raid 5 has been the compromise everywhere.  But, in the meantime, while waiting for Microsoft, we made two changes and the I/O errors in the log have disappeared!  We increased the disk read ahead cache multiplier to 10 (I don't know what it was before) and we saw some unused space on the local C: drive so we did a quick format and created a volume out of the remaining space.  My server expert said that having unallocated space on the C: drive can cause degradation in OS perfomance while the OS is attempting to find out what drives it has to work with. 

    Thanks for the responses and I'll post an update from Microsoft.  Much appreciated. 

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply