EventID 824 Revisited

  • Folks in the past couple of days we have had one of our SQL2005 DBs on a SAN disk generate Eventid 824 every few hours requiring a restore of the db. We have been having the problem on/off for a few months now but the last couple of days it has been worse.

    We have sent logs to MS and they have not found anything and recommended we have the HBAs swapped out.

    Our SAN guys have not seen any errors and there are no errors reported by the OS

    My question is if I have other DBs on the same SAN drive and they have not had any problems whatsoever, how can it be that this one is becoming corrupted all the time. Of course it is the largest database on the disk and has the most updates against it.

    I am at my wit's end with this db, and don't what would be the next thing to check (short of changing the HBAs of course).

    Has anyone run into anything like this, it seems to me the SQL2005 dbs are a lot more susceptible to database corruption then SQL2000

    Has anyone out there seen anything similar? Thanks.

  • Event ID 824 is similar to the 823 error message which is related to problem in the I/O subsystem, such as a failing disk drive, disk firmware problems, faulty device driver, and so on. This error indicates that the page is successfully read from disk, but SQL Server has discovered something wrong with the page. This usually indicates a  Please check your hardware and if possible run SQLIOTrace utility which will help.

    Use the PAGE_VERIFY CHECKSUM option for that particular database.

     

    Minaz Amin

    "More Green More Oxygen !! Plant a tree today"

  • Can you run SQLIOSim on those disks?

    This can help you proof that it is a hardware (or driver) problem.

  • SQLIOSim was run and nothing was discovered. Our DBA was then asked by some friends at MS to run a different DBCC command option (sorry don't recall it) and was able to discover a corrupted index and table that wasn't showing up during a regular DBCC CHECKDB after performing restores. Going through a few backups it seems that the backups were backing up a corrupted database

    It seems to me that this Eventid 824 is not always what it seems to be.

  • Hi.

    I had a similar problem years ago.  This was eventually fixed by upgrading the SAN firmware.  It was an MSA1000.

    In our case, the database was not corrupted, but the in-memory pages were. 

    DBCC wouldn't alwyas report errors, but we had problems running a very large DTS package.  The interim workaround was to issue a DBCC DROPCLEANBUFFERS right before running the package.

    Of course the DBCC DROPCLEANBUFFERS is a pretty nasty thing to have to do to a production database, but it did solve our problem until we could get the firmware upgraded.

     

  • That's interesting, thanks for that Jeff, unfortunately it is not as simple as this. We run dozens of servers off this particular SAN that are running Oracle under Unix (among other things) without issue. If the SAN was dedicated to SQL servers only.. then I would take this into consideration, but it seems that this corruption we experienced has been there for awhile (don't know if it was caused by the SAN in the first place), but it is not very re-assuring that the errors are not being reported correctly. Since I am not DBA (I come here to learn enough SQL server to be a better sysadmin 🙂 ) I can only comment from what i see from my perspective

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply