Intermittent Data Corruption Problems

  • Okay this post is about what is happening on my other sql 2000 failover cluster that caused me to need to move 20 databases to a different cluster which started to cause Transaction Log Backup Performance Issues (See other thread.)  The cluster  with the data corruption issue is comprised of two good quality duel processor, 4 gb ram HP servers that both connect to one external disk array device.

    Since the problem has escalated to us calling Microsoft, thought you guys might be interested or someone will find it handy someday.

    So symptoms:  over the past few months we've had far between and intermittent data corruption problems in multiple user databases on this cluster.  Most of them reported Sev 20 or Sev 21 Alerts through SQL (although some errors seemed to skip sql alerts and just wrote to the server event log.)  After the majority of the errors, I had to run dbcc checkdb with repair_fast (sometimes repair_rebuild) to fix the issues.  None of the customers reported data loss due to the corruption... either it's something they didn't notice, or they reentered the data, or we're just lucky that it hasn't happened... yet!

    I also noticed some errors reported by compaq insight manager on some of the physical disks in the data drive array.  So we ended up replacing two disks in the stripe set (the spare and then one active drive.)  However in the weeks following that we still received data corruption errors and some of the other drives in the stripe set now reported just a couple new read or write errors.

     We have ~250 customer databases on the cluster and I can't move them to the other cluster because starting to do that caused performance issues with that other cluster (see other thread.)

    Well that's the general gist of the problem. Uploaded mps reports to Microsoft but haven't received anything useful yet.

    Thanks,

    Paula

  • Problem is getting worse!  I received a sev 24 torn page error this afternoon and a sev 21 error last night.  Luckily none of the errors resulted in corruption.

    After being on the phone all day with Microsoft, here is our current troubleshooting to-do list:

    - update drivers and firmware for the servers and compaq storageworks device.

    - disable the read cache for a couple of days to see if that helps (could cause performance issues so not sure if we'll be able to do this yet.)

    - upgrade our sp3/760 version to sp4 or at minimum sp3a/847 hotfix level.  http://support.microsoft.com/default.aspx?scid=kb;en-us;826433  This kb sure describes 95% of the data corruption errors we've had over the past few months.

    - run sqliostress tool on the cluster during the maint window.

    - have our server engineer open up a case with HP to troubleshoot the hardware side since hardware is also a strong possible cause of our problem.

    It's going to be a long night for me saturday night... that's for sure!

     

  • Thanx for sharing with us... too bad with can't be of more help.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply