Alerts not firing?!?

  • We recently had a situation where we lost our conenction to the SAN storage for about 1.25 minutes.  We're running W2K3 Ent, 8 instances of SQL2K Ent, about 120 dbs across them all.  Each database has alerts set up for errors with severity of 19 and higher.  SQLAgent was running for each instance, all the alerts were enabled.  Only 4 alerts fired! They were all from the same database on one of the instances.  After going through all the error logs, I found that all of the instances had encountered write errors, and one of the instances even shut itself down.  All the alerts are creating the same way, as we have a script we run for new databases to set everything up around them.  The only alerts that showed that they had ever been fired were the previously mentioned four.  Has anybody encountered a similar problem?

     

    Ian Dundas
    Senior IT Analyst - Database
    Manitoba Public Insurance Corp.

  • Yes, I have run into this situation many times especially when the root cause is loss of connectivity to SANs and therefore, loss of the ability to access the database files.

    When an event occurs, SQL Server writes the event to the Windows Application event log. The SQL Agent monitors the event log, and for each SQL Server message, reads the msdb table sysalerts to determine what action, if any, is to be taken.

    If SQL Server cannot write to the Windows Application Event Log or the SQL Agent cannot read Windows Application Event Log or the SQL Agent cannot read the msdb sysalerts table, then no event actions can occur.

    With loss of connectivity to the SAN, since msdb is probably on the SAN, the SQL Agent probably cannot read the sysalerts table, and therefore no notification is performed.

    Hardware events need to be monitored using other tools such as Compaq's Insight Manager, HP's OpenView, IBM's Tivoli, BMC's Patrol or CA's UniCenter.

    SQL = Scarcely Qualifies as a Language

  • That makes sense.  I imagine the four that were fired happened before the msdb data files for that instance went offline.  I guess its time to get MOM moved from testing to production.  That, or write a simple service to monitor the event logs for these.  Hmm, might just do that anyways.

    Thanks,

    Ian Dundas
    Senior IT Analyst - Database
    Manitoba Public Insurance Corp.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply