need help with consistent database corruption

  • We’ve had some database corruption occur on 3 of our databases (out of around 250) 2 nights in a row and we’re not sure exactly what is causing it. The corruption and errors are eerily consistent in time and nature across both nights. Any help would be appreciated. After the first night, we restored all 3 databases to a good backup, but we experienced corruption again the following night on the same 3 dbs. The first indication on both nights was with DatabaseA during a maintenance job.

    All 3 databases reside on a SAN and all the checks have come out good according to our SAN guy. Memory has checked out good on the db server (there is a memory error in one of the logs). We’re running SQL 2005 SP3 on Windows Server 2003 SP2 with 32 GB of RAM. There are 2 instances on the db server with all 3 problematic databases on one instance.

    Here are the errors for the databases:

    *** DatabaseA ***

    We have a maintenance job that runs at ~ 12:55 am that performs a number of business related cleanup stuff but also records file space usage with the following commands:

    exec sp_spaceused @updateusage = TRUE

    dbcc sqlperf (logspace)

    DBCC SHOWFILESTATS

    During the running of that job the last 2 nights, SQL does a stack dump. Here is the abbreviated version of that dump:

    BEGIN DatabaseA DUMP

    A system assertion check has failed. Check the SQL Server error log for details. Typically, an assertion failure is caused by a software bug or data corruption. To check for database corruption, consider running DBCC CHECKDB. If you agreed to send dumps to Microsoft during setup, a mini dump will be sent to Microsoft. An update might be available from Microsoft in the latest Service Pack or in a QFE from Technical Support.

    Error: 3624, Severity: 20, State: 1.

    SQL Server Assertion: File: , line=1493 Failed Assertion = 'ret'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

    Error: 17066, Severity: 16, State: 1.

    * Expression: ret

    * Location: dbccusag.cpp:1493

    Unable to read and latch page (13638:1308622848) with latch type SH. UtilDbccVerifyPageId failed.

    Error: 8966, Severity: 16, State: 4.

    END DatabaseA DUMP

    At 2:23 am, the following error is thrown on DatabaseA:

    SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:4721; actual 30009:571206162). It occurred during a read of page (1:4721) in database ID 564 at offset 0x000000024e2000 in file ' .mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    At 3:30 am during DBCC CHECKDB, another STACK DUMP is thrown with the following errors:

    DBCC encountered a page with an LSN greater than the current end of log LSN (3492:0:1) for its internal database snapshot. Could not read page (43521:2144598655), database ' ’ (database ID 17), LSN = (-2136952720:-1560195711:59267), type = 187, isInSparseFile = 1. Please re-run this DBCC command.

    DoMiniDump () encountered error (0x800703E6) - Invalid access to memory location.

    At 8 am, a manual CHECKDB creates another dump with the same LSN error above, as well as the following:

    SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0x0; actual signature: 0x443c57a0). It occurred during a read of page (1:4721) in database ID 564 at offset 0x000000024e2000 in file ' .mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

    One more note on DatabaseA... On night one, a full backup was made at 10 pm. That backup contained the corruption so we had to go back to tape to get a good full then restore a diff and some trans. I guess that would indicate that the corruption happened prior to any maintenance that occurred that night, but there is nothing in the logs indicating problems prior to 12:55 am. After restoring in the morning, we did a full backup before putting it back into production. A diff backup was performed that night at 10 pm. The next morning we successfully restored off of the full and diff from the previous night.

    *** DatabaseB ***

    A DBCC CHECKDB returns the following:

    DBCC encountered a page with an LSN greater than the current end of log LSN (1292:0:1) for its internal database snapshot. Could not read page (58492:317719243), database (database ID 17), LSN = (-336530415:-450460774:48410), type = 59, isInSparseFile = 1. Please re-run this DBCC command.

    *** DatabaseC ***

    A DBCC CHECKDB returns the following:

    DBCC encountered a page with an LSN greater than the current end of log LSN (2021:0:1) for its internal database snapshot. Could not read page (17377:-946470072), database (database ID 20), LSN = (-1278015064:1416871827:17038), type = 171, isInSparseFile = 1. Please re-run this DBCC command.

    That was a long post. If you're still reading, and you have some ideas, let me know. We're going to do diagnostics tonight on the local drives as well as a reboot. Then we'll probably be doing checkdb's throughout the evening and morning to determine the exact time of corruption.

    Regards,
    Rubes

  • In 99% of cases, corruption (especially recurring corruption) is a result of a problem with the IO subsystem. It may be faulty drivers, it may be faulty SAN controllers, it may be faulty disks, it may be a dodgy write cache.

    From the looks of the errors, especially the very wrong-looking page IDs (43521:2144598655), it looks like something is writing garbage over pieces of the SQL data file.

    I notice the reference to torn pages. This DB was upgraded from SQL 2000? Can you enable page checksums on all of your databases? It will help pinpoint whether the problem is memory or disk. If you can, once you've enabled page checksums, rebuild all your indexes. That will force the checksum to be calculated and written to every page, as it's not done when the option is enabled, rather the next time the page is written.

    If I was in your place, I'd be looking to move those databases onto alternate storage immediately.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • maybe stupid, but did someone use any form of compression ? (drive/LUN level)

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Thanks for the input Gail. Checksums are enabled and have been for quite a while. It is possible that the databases were originally SQL 2000 but I'm not sure.

    One thing that we've found since the original post is that our 3rd party backup software (Tivoli) has been throwing errors trying to access an ldf file. It's supposed to be configured to not backup that directory (only the backup directory holding the .bak files), but apparently it's hitting the db files in some way. Although, our corrupt databases are not in that directory, nor even on that drive, but all of our system databases including tempdb are. We've made some adjustments to the configuration of the backup software (although I don't trust it) and hopefully that is the root of the issue.

    BTW, we don't use any compression.

    Thanks for the help and I will keep you informed.

    Regards,
    Rubes

  • rubes (4/3/2009)


    Thanks for the input Gail. Checksums are enabled and have been for quite a while.

    Odd, because there's references to torn pages. It means that not all the pages have checksums on them, which they should if it's been enabled for a long time.

    We've made some adjustments to the configuration of the backup software (although I don't trust it) and hopefully that is the root of the issue.

    Highly, highly unlikely. If the backup software managed to get a lock on the file you would get a very different error. (Error 823 severity 23 with an operating system error listed as well)

    The errors you are getting are logical corruption messages. They mean that the OS completed the IO as requested, but the page/data returned was not as expected. It indicate that something somewhere is overwriting portions of the SQL data file with random values. A backup product won't do that.

    What can cause that is the following:

    Problems with the IO subsystem (vast majority)

    Misbehaving IO filter driver, like an antivirus (uncommon)

    A bug within SQL (unlikely, I haven't heard of anything like this on 2005 SP3)

    Faulty memory (very, very rare and won't be able to cause torn pages or checksum errors)

    see: http://www.sqlskills.com/BLOGS/PAUL/post/Search-Engine-QA-26-Myths-around-causing-corruption.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail (and others) - once a database is upgraded and checksums are enabled, torn-page protection is still in place until a page is subsequently changed - see Inside The Storage Engine: Does turning on page checksums discard any torn-page protection?

    This is definitely your I/O subsystem...

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Paul Randal (4/3/2009)


    Gail (and others) - once a database is upgraded and checksums are enabled, torn-page protection is still in place until a page is subsequently changed

    I'm just surprised that there are any remaining pages with only torn page if the checksums have been enabled for a long time. Unless there are static tables that never get modified and never get an index rebuild, pages should have been modified at least once if the check sums have been enabled for quite a while.

    Though I suppose it depends on the definition of 'quite a while'

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Well... I rebooted the server and integrity checks passed after failing 2 nights in a row.

    One thing I didn't mention was that we had a bad SAN drive 2 days earlier. I didn't mention it because it was replaced and everything checked out fine including integrity checks. Corruption didn't show up in SQL until 2 days later and by that time the SAN guys were reporting nothing but health. Is it possible that we had corruption at that time but didn't realize it because all of the pages did not have checksums?

    One last question, is there any DBCC command that can identify a database page that does not have a checksum on it?

    Regards,
    Rubes

  • It's possible for a page to be corrupted, by a few bits being flipped here and there, that doesn't affect the structural integrity of the records, page, or index/table the page is part of (e.g. in the middle of a varchar column) - but that would be detected if a page checksum was present.

    There's no tool to either check which pages don't have checksums, or to put checksums on all pages - unfortunately.

    I'm working on something that will do that - but it's not ready for prime-time quite yet.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • There's no tool to either check which pages don't have checksums, or to put checksums on all pages - unfortunately.

    I'm working on something that will do that - but it's not ready for prime-time quite yet.

    Any luck on this? I just recovered a DB for a customer that was in torn_page not checksum. From what I can tell it it was a power outage issue. But it took about 18 hours of work for what should have been about 10 minutes for SQL to get the DB back to a consistent state.



    ----------------
    Jim P.

    A little bit of this and a little byte of that can cause bloatware.

  • I am clearly a latecomer to this thread, but there is one vital thing that needs to be considered as well. Does the disk subsystem use disk mirroring? (aka RAID-1 or RAID-10)

    If mirroring is being used, then the absolute non-negotiable first step in dealing with suspected database corruption is to perform a verify on the disk volume. Otherwise, there exists the possibility of problems stemming from sectors on one side of the mirror differing from those on the copy. A verify will at least make sure that each sector on the two sides of the mirror contain the same thing.

    When a mirrored disk volume services reads, it can service them from either drive. The RAID hardware will perform the read from whichever drive it considers more available - which at any given time, is as good as random. If each drive contains a different version of one or more pages in a database, then it is likely impossible to repair the database without resolving this first. The repair will make the file consistent to whichever version happened to be read when the repair process ran, but as soon as the alternate version is fetched in response to a read request, the database will again appear corrupted.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply