DBCC CHECKDB ends with false positives

  • We have numerous instances of false positives on our DBCC CHECKDB runs on our production SAP servers with SQL Server 2005 SP3. That is the CHECKDB founds consistency errors in one or more tables, but when a DBCC CHECKTABLE is run on that table or another DBCC CHECKDB is run it shows no errors. We run DBCC CHECKDB daily on our 2 production systems and get a false positive about every one or two weeks.

    The DBCC CHECKDB runs from 5 to 6 hours.

    Whenever we get a false positive usually one of these two errors has occured during the run:

    spid189 DBCC CHECKDB is performing an exhaustive search of indexes for possible inconsistencies.

    spid2s SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [f:\pecdata2\pecdata2.ndf] in database [PEC]

    Here is an example of the error log:

    2010-03-11 22:47:17.49 spid135 DBCC CHECKDB (PEC) WITH no_infomsgs executed by RPU\SAPServicePEC found 4 errors and repaired 0 errors. Elapsed time: 5 hours 47 minutes 16 seconds.

    2010-03-11 22:47:17.51 spid135 Using 'dbghelp.dll' version '4.0.5'

    2010-03-11 22:47:17.60 spid135 **Dump thread - spid = 135, PSS = 0x00000001B875DBD0, EC = 0x00000001B875DBE0

    2010-03-11 22:47:17.60 spid135 ***Stack Dump being sent to F:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0018.txt

    2010-03-11 22:47:17.60 spid135 * *******************************************************************************

    2010-03-11 22:47:17.60 spid135 *

    2010-03-11 22:47:17.60 spid135 * BEGIN STACK DUMP:

    2010-03-11 22:47:17.60 spid135 * 03/11/10 22:47:17 spid 135

    2010-03-11 22:47:17.60 spid135 *

    2010-03-11 22:47:17.60 spid135 * DBCC database corruption

    2010-03-11 22:47:17.60 spid135 *

    2010-03-11 22:47:17.60 spid135 * Input Buffer 86 bytes -

    2010-03-11 22:47:17.60 spid135 * DBCC CHECKDB WITH NO_INFOMSGS;

    2010-03-11 22:47:17.60 spid135 *

    2010-03-11 22:47:17.60 spid135 * *******************************************************************************

    2010-03-11 22:47:17.60 spid135 * -------------------------------------------------------------------------------

    2010-03-11 22:47:17.60 spid135 * Short Stack Dump

    2010-03-11 22:47:17.91 spid135 Stack Signature for the dump is 0x00000000000002CC

    2010-03-11 22:47:20.96 spid135 External dump process return code 0x20000001.

    And what job log looks like:

    Job 'CheckDB' : Step 1, 'Check Database' : Began Executing 2010-03-11 17:00:00

    Msg 8928, Sev 16, State 1, Line 1 : Object ID 1789795471, index ID 1, partition ID 72057602467495936, alloc unit ID 72057602585395200 (type In-row data): Page (1:2387521) could not be processed. See other errors for details. [SQLSTATE 42000]

    Msg 8939, Sev 16, State 1, Line 1 : Table error: Object ID 1789795471, index ID 1, partition ID 72057602467495936, alloc unit ID 72057602585395200 (type In-row data), page (1:2387521). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 180488201 and -4. [SQLSTATE 42000]

    Msg 8976, Sev 16, State 1, Line 1 : Table error: Object ID 1789795471, index ID 1, partition ID 72057602467495936, alloc unit ID 72057602585395200 (type In-row data). Page (1:2387521) was not seen in the scan although its parent (1:2387779) and previous (1:2387520) refer to it. Check any previous errors. [SQLSTATE 42000]

    Msg 8978, Sev 16, State 1, Line 1 : Table error: Object ID 1789795471, index ID 1, partition ID 72057602467495936, alloc unit ID 72057602585395200 (type In-row data). Page (1:2387522) is missing a reference from previous page (1:2387521). Possible chain linkage problem. [SQLSTATE 42000]

    Msg 8990, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in table 'pec.DSYS_PHCONT_E_CD' (object ID 1789795471). [SQLSTATE 01000]

    Msg 8989, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in database 'PEC'. [SQLSTATE 01000]

    Msg 8958, Sev 16, State 1, Line 1 : repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (PEC). [SQLSTATE 01000]

    What is the reason for so many false positives?

    Is DBCC CHECKDB an unreliable way to check the databases?

    Any suggestions?

    Thanks!

  • As I posted over at SQLTeam...

    If checkDB reports errors it means that it found errors. They may be transient IO errors.

    Are you seeing any 825 errors in your error log? They're severity 10, so the standard alerts won't pick them up.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • No - they're not false positives. False positives were possible in 2000, but when I rewrote CHECKDB for 2005, some of the underlying mechanisms were changed to remove that possibility.

    You either have transient I/O problems (*very* possible and I see this frequently on customer systems), but more likely is that you have other daily maintenance (or just the regular workload) results in the corrupt pages being deallocated between the first and second consistency checks.

    I wrote this up in a bit more detail in my blog post: ]Misconceptions around corruptions: can they disappear?[/url] - that's the best explanation you're going to get.

    Yes, CHECKDB is a reliable (and the only) way to consistency check the database.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • I run the second CHECKDB immediately after the first CHECKDB ends with errors without doing anything else in between and it runs clean.

  • Then you have transient IO problems. Are you seeing any 825 errors in your error log?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Chuck Olufson (3/31/2010)


    I run the second CHECKDB immediately after the first CHECKDB ends with errors without doing anything else in between and it runs clean.

    That may be so, but it means nothing. Your CHECKDB run time is 5h 47m, which means that if you run the CHECKDBs back-to-back, there's almost 6 hours between the times that each individual page in the database is being read by CHECKDB. That's a lot of time for transient I/O issues to disappear.

    Do you have the last few sets of errors to post? I'd like to see if there's a pattern in the page IDs.

    Also, the error about the I/O taking more than 15 seconds is a classic sign of an underpowered I/O subsystem, or a system that's struggling to work around faults in the I/O subsystem. The fact that you see these at the same time as the corruptions seems more than coincidental to me.

    You're not seeing false positives.

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Do you have the last few sets of errors to post? I'd like to see if there's a pattern in the page IDs.

    Here are three more sets of errors:

    2010-03-18 22:55:33.08 spid188 DBCC CHECKDB (PEC) WITH no_infomsgs executed by RPU\SAPServicePEC found 4 errors and repaired 0 errors. Elapsed time: 5 hours 55 minutes 32 seconds.

    2010-03-18 22:55:33.09 spid188 Using 'dbghelp.dll' version '4.0.5'

    2010-03-18 22:55:33.09 spid188 **Dump thread - spid = 188, PSS = 0x000000008AB73BD0, EC = 0x000000008AB73BE0

    2010-03-18 22:55:33.09 spid188 ***Stack Dump being sent to F:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0019.txt

    2010-03-18 22:55:33.09 spid188 * *******************************************************************************

    2010-03-18 22:55:33.09 spid188 *

    2010-03-18 22:55:33.09 spid188 * BEGIN STACK DUMP:

    2010-03-18 22:55:33.09 spid188 * 03/18/10 22:55:33 spid 188

    2010-03-18 22:55:33.09 spid188 *

    2010-03-18 22:55:33.09 spid188 * DBCC database corruption

    2010-03-18 22:55:33.09 spid188 *

    2010-03-18 22:55:33.09 spid188 * Input Buffer 86 bytes -

    2010-03-18 22:55:33.09 spid188 * DBCC CHECKDB WITH NO_INFOMSGS;

    2010-03-18 22:55:33.09 spid188 *

    2010-03-18 22:55:33.09 spid188 * *******************************************************************************

    2010-03-18 22:55:33.09 spid188 * -------------------------------------------------------------------------------

    2010-03-18 22:55:33.09 spid188 * Short Stack Dump

    2010-03-18 22:55:33.11 spid188 Stack Signature for the dump is 0x00000000000002CC

    2010-03-18 22:55:35.51 spid188 External dump process return code 0x20000001.

    Job 'CheckDB' : Step 1, 'Check Database' : Began Executing 2010-03-18 17:00:01

    Msg 8928, Sev 16, State 1, Line 1 : Object ID 1747354581, index ID 1, partition ID 72057599211405312, alloc unit ID 72057599267438592 (type In-row data): Page (4:12148319) could not be processed. See other errors for details. [SQLSTATE 42000]

    Msg 8939, Sev 16, State 1, Line 1 : Table error: Object ID 1747354581, index ID 1, partition ID 72057599211405312, alloc unit ID 72057599267438592 (type In-row data), page (4:12148319). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 12716041 and -4. [SQLSTATE 42000]

    Msg 8976, Sev 16, State 1, Line 1 : Table error: Object ID 1747354581, index ID 1, partition ID 72057599211405312, alloc unit ID 72057599267438592 (type In-row data). Page (4:12148319) was not seen in the scan although its parent (3:2675110) and previous (4:12148318) refer to it. Check any previous errors. [SQLSTATE 42000]

    Msg 8978, Sev 16, State 1, Line 1 : Table error: Object ID 1747354581, index ID 1, partition ID 72057599211405312, alloc unit ID 72057599267438592 (type In-row data). Page (4:12148440) is missing a reference from previous page (4:12148319). Possible chain linkage problem. [SQLSTATE 42000]

    Msg 8990, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in table 'pec.SWWCNTP0' (object ID 1747354581). [SQLSTATE 01000]

    Msg 8989, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in database 'PEC'. [SQLSTATE 01000]

    Msg 8958, Sev 16, State 1, Line 1 : repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (PEC). [SQLSTATE 01000]

    2010-03-19 23:07:32.85 spid188 DBCC CHECKDB (PEC) WITH no_infomsgs executed by RPU\SAPServicePEC found 3 errors and repaired 0 errors. Elapsed time: 6 hours 7 minutes 32 seconds.

    2010-03-19 23:07:32.85 spid188 Using 'dbghelp.dll' version '4.0.5'

    2010-03-19 23:07:32.86 spid188 **Dump thread - spid = 188, PSS = 0x00000002316CFBD0, EC = 0x00000002316CFBE0

    2010-03-19 23:07:32.86 spid188 ***Stack Dump being sent to F:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0020.txt

    2010-03-19 23:07:32.86 spid188 * *******************************************************************************

    2010-03-19 23:07:32.86 spid188 *

    2010-03-19 23:07:32.86 spid188 * BEGIN STACK DUMP:

    2010-03-19 23:07:32.86 spid188 * 03/19/10 23:07:32 spid 188

    2010-03-19 23:07:32.86 spid188 *

    2010-03-19 23:07:32.86 spid188 * DBCC database corruption

    2010-03-19 23:07:32.86 spid188 *

    2010-03-19 23:07:32.86 spid188 * Input Buffer 86 bytes -

    2010-03-19 23:07:32.86 spid188 * DBCC CHECKDB WITH NO_INFOMSGS;

    2010-03-19 23:07:32.86 spid188 *

    2010-03-19 23:07:32.86 spid188 * *******************************************************************************

    2010-03-19 23:07:32.86 spid188 * -------------------------------------------------------------------------------

    2010-03-19 23:07:32.86 spid188 * Short Stack Dump

    2010-03-19 23:07:32.88 spid188 Stack Signature for the dump is 0x00000000000002CC

    2010-03-19 23:07:35.35 spid188 External dump process return code 0x20000001.

    Job 'CheckDB' : Step 1, 'Check Database' : Began Executing 2010-03-19 17:00:00

    Msg 8928, Sev 16, State 1, Line 1 : Object ID 1754175928, index ID 1, partition ID 72057603887792128, alloc unit ID 72057604035837952 (type In-row data): Page (3:12640636) could not be processed. See other errors for details. [SQLSTATE 42000]

    Msg 8976, Sev 16, State 1, Line 1 : Table error: Object ID 1754175928, index ID 1, partition ID 72057603887792128, alloc unit ID 72057604035837952 (type In-row data). Page (3:12640636) was not seen in the scan although its parent (4:12726851) and previous (3:12640635) refer to it. Check any previous errors. [SQLSTATE 42000]

    Msg 8978, Sev 16, State 1, Line 1 : Table error: Object ID 1754175928, index ID 1, partition ID 72057603887792128, alloc unit ID 72057604035837952 (type In-row data). Page (3:12640637) is missing a reference from previous page (3:12640636). Possible chain linkage problem. [SQLSTATE 42000]

    Msg 8990, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 3 consistency errors in table 'pec.DPAYP' (object ID 1754175928). [SQLSTATE 01000]

    Msg 8989, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 3 consistency errors in database 'PEC'. [SQLSTATE 01000]

    Msg 8958, Sev 16, State 1, Line 1 : repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (PEC). [SQLSTATE 01000]

    2010-03-24 23:00:07.72 spid189 DBCC CHECKDB (PEC) WITH no_infomsgs executed by RPU\SAPServicePEC found 4 errors and repaired 0 errors. Elapsed time: 6 hours 0 minutes 7 seconds.

    2010-03-24 23:00:07.73 spid189 Using 'dbghelp.dll' version '4.0.5'

    2010-03-24 23:00:07.73 spid189 **Dump thread - spid = 189, PSS = 0x000000028B3E9BD0, EC = 0x000000028B3E9BE0

    2010-03-24 23:00:07.73 spid189 ***Stack Dump being sent to F:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\SQLDump0021.txt

    2010-03-24 23:00:07.73 spid189 * *******************************************************************************

    2010-03-24 23:00:07.73 spid189 *

    2010-03-24 23:00:07.73 spid189 * BEGIN STACK DUMP:

    2010-03-24 23:00:07.73 spid189 * 03/24/10 23:00:07 spid 189

    2010-03-24 23:00:07.73 spid189 *

    2010-03-24 23:00:07.73 spid189 * DBCC database corruption

    2010-03-24 23:00:07.73 spid189 *

    2010-03-24 23:00:07.73 spid189 * Input Buffer 86 bytes -

    2010-03-24 23:00:07.73 spid189 * DBCC CHECKDB WITH NO_INFOMSGS;

    2010-03-24 23:00:07.73 spid189 *

    2010-03-24 23:00:07.73 spid189 * *******************************************************************************

    2010-03-24 23:00:07.73 spid189 * -------------------------------------------------------------------------------

    2010-03-24 23:00:07.73 spid189 * Short Stack Dump

    2010-03-24 23:00:07.75 spid189 Stack Signature for the dump is 0x00000000000002CC

    2010-03-24 23:00:10.16 spid189 External dump process return code 0x20000001.

    Job 'CheckDB' : Step 1, 'Check Database' : Began Executing 2010-03-24 17:00:00

    Msg 8978, Sev 16, State 1, Line 1 : Table error: Object ID 1007120616, index ID 1, partition ID 72057602155347968, alloc unit ID 72057602268594176 (type In-row data). Page (1:4769367) is missing a reference from previous page (3:4697626). Possible chain linkage problem. [SQLSTATE 42000]

    Msg 8928, Sev 16, State 1, Line 1 : Object ID 1007120616, index ID 1, partition ID 72057602155347968, alloc unit ID 72057602268594176 (type In-row data): Page (3:4697626) could not be processed. See other errors for details. [SQLSTATE 42000]

    Msg 8939, Sev 16, State 1, Line 1 : Table error: Object ID 1007120616, index ID 1, partition ID 72057602155347968, alloc unit ID 72057602268594176 (type In-row data), page (3:4697626). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 197265417 and -4. [SQLSTATE 42000]

    Msg 8976, Sev 16, State 1, Line 1 : Table error: Object ID 1007120616, index ID 1, partition ID 72057602155347968, alloc unit ID 72057602268594176 (type In-row data). Page (3:4697626) was not seen in the scan although its parent (4:4598691) and previous (3:4697628) refer to it. Check any previous errors. [SQLSTATE 42000]

    Msg 8990, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in table 'pec.DFKKSUM' (object ID 1007120616). [SQLSTATE 01000]

    Msg 8989, Sev 16, State 1, Line 1 : CHECKDB found 0 allocation errors and 4 consistency errors in database 'PEC'. [SQLSTATE 01000]

    Msg 8958, Sev 16, State 1, Line 1 : repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (PEC). [SQLSTATE 01000]

  • GilaMonster (3/31/2010)


    Then you have transient IO problems. Are you seeing any 825 errors in your error log?

    No we are not having any 825 errors.

  • How often do you take full backups? It's worth taking your backups to another machine and restoring some of them and running DBCC CHECKDB on them to see if the corruption is being picked up by backups too.

    You're going to need to run I/O subsystem diagnostics (which likely won't show anything) and then SQLIOSim to expose the I/O subsystem issues. CHECKDB is one of the most I/O intensive things that SQL Server does, so ironically, it may only be while CHECKDB is running that the I/O subsystem is exhibiting these obvious problems.

    That's the diagnosis, there's no other explanation apart from a random corruption bug in SQL Server, and something corrupting index structures like that would be popping up all other the place. It's your I/O subsystem.

    One way to convince yourself is to do a DBCC PAGE on the pages in the error output and check to see that they really do link to each other - that would instantly confirm that the I/O subsystem is given a bad page image when CHECKDB asks for it.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • I wrote this up in a bit more detail in my blog post: ]Misconceptions around corruptions: can they disappear?[/url] - that's the best explanation you're going to get.

    In this article it states "

    A corruption can seem to 'disappear' if a corrupt page is allocated at the time a DBCC CHECKDB runs, but is then deallocated by the time the next DBCC CHECKDB runs. The first time it will be reported as corrupt, but the second time it's not allocated, so it isn't consistency checked and won't be reported as corrupt. The corruption looks like it's mysteriously vanished. But it hasn't - it's just that the corrupt page is no longer allocated.

    "

    If the corrupt page is no longer allocated, does that mean that the corruption is still there and CHECKDB is unable to report it?

    What are we supposed to do about the error if a second CHECKDB will not show it?

    On one of our production servers we have had over 30 instances in the past year of errors reported on the first CHECKDB and then not showing on the second run. The last error was over a month ago and daily CHECHDB's since have showed no errors. Do we need to be concerned that corruption still exists?

    The other production server has had 4 instances in the past month of errors reported on the first CHECKDB and then not showing on the second run. The error logs for that server are posted here. The last error was a week ago and daily CHECHDB's since have showed no errors. Do we need to be concerned that corruption still exists?

  • Paul Randal (4/1/2010)


    How often do you take full backups? It's worth taking your backups to another machine and restoring some of them and running DBCC CHECKDB on them to see if the corruption is being picked up by backups too.

    We do full backups once a week, but do not do a backup if CHECKDB has reported errors.

    We only do a full backup immediately after a CHECKDB with no errors.

  • How often do you take full backups? It's worth taking your backups to another machine and restoring some of them and running DBCC CHECKDB on them to see if the corruption is being picked up by backups too.

    We do full backups once a week, but do not do a full backup if CHECKDB has reported errors.

    We only do a full backup immediately after a CHECKDB with no errors.

  • Hey Chuck,

    To answer your questions a few posts up:

    1) If the corrupt page is no longer allocated, does that mean that the corruption is still there and CHECKDB is unable to report it?

    Maybe. The page is no longer part of the allocated portion of the database and so CHECKDB can't read it - it doesn't know whether that page has ever been used and so what should be on it. That portion of the disk may still exhibit problems that prevent it being read though. SQL Server won't find out until it tries to reuse that portion of the data file.

    2) What are we supposed to do about the error if a second CHECKDB will not show it?

    Accept that your I/O subsystem has corruption problems, and that it's not CHECKDB not showing the corruption because something's wrong with CHECKDB.

    3) On one of our production servers we have had over 30 instances in the past year of errors reported on the first CHECKDB and then not showing on the second run. The last error was over a month ago and daily CHECHDB's since have showed no errors. Do we need to be concerned that corruption still exists?

    The other production server has had 4 instances in the past month of errors reported on the first CHECKDB and then not showing on the second run. The error logs for that server are posted here. The last error was a week ago and daily CHECHDB's since have showed no errors. Do we need to be concerned that corruption still exists?

    Any time any corruption shows up, it's cause for concern. 99.999% of the time (in the thousands of cases of corruption I've seen and worked on both inside and outside Microsoft) it's the I/O subsystem. 0.0005% is bad memory. 0.0005% is SQL Server corruption bugs.

    Your case has the signature of transient I/O subsystem corruption. I realize its hard to accept - almost a leap of faith - but that's what all the symptoms are pointing to.

    Thanks

    Paul Randal
    CEO, SQLskills.com: Check out SQLskills online training!
    Blog:www.SQLskills.com/blogs/paul Twitter: @PaulRandal
    SQL MVP, Microsoft RD, Contributing Editor of TechNet Magazine
    Author of DBCC CHECKDB/repair (and other Storage Engine) code of SQL Server 2005

  • Good Day,

    we are running SQL Server 2005 SP3 on our production server and the DBCC checkdb failed with a Severity 10 messageand even created stackdumps . Surely this cannot be just informational . I could not find any errors in the event log that could indicate disk errors . Should we run nnow dbcc checks on all the tables and drop the WA_sys Stats ? Can Double Take cause corruption like this ?

  • Please post new questions in a new thread and include the errors.

    Thanks

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply