Etiquette

  • Comments posted to this topic are about the item Etiquette

  • interesting article...

    true we do get too much junk emailed to us in the form of 'web server a has been down for 5 minutes!' when in reality the server has taken more than 150ms to respond to a ping due to heavy network traffic at lunch time.

    this IS a pain.

    I'm not sure I agree on the subject of 'dont alert me that everything is fine' though. We have a system here that emails out alerts if certain things happen, like a database file is about to grow beyond the 2gb sco htfs limit. clearly this would require urgent and immediate attention. in times past the email server for one reason or another has begun to reject messages from this machine, or the cron job that sends the email has failed etc etc and issues have occurred without our ever having received an email.

    Now, the server sends us emails every single day to report the all clear. this way we have piece of mind that if anything DOES go wrong the alerts will get to us on time.

    Ben

    ^ Thats me!

    ----------------------------------------
    01010111011010000110000101110100 01100001 0110001101101111011011010111000001101100011001010111010001100101 01110100011010010110110101100101 011101110110000101110011011101000110010101110010
    ----------------------------------------

  • The company that I work for provides remote monitoring of various types of engines. Invariably, whenever we get a new customer, they want to be notified of every event, such as the engine starting or stopping. For some applications, an engine starting is a regular occurrence and we recommend only getting notifications when something goes wrong.

    It doesn't take too long for those customers who insisted on everything to come back and request changes to only get the notices on failures. For those who don't make the changes, the notices become "noise" and when a true problem comes up, they are more likely to dismiss it as noise.

    In general, I limit my own personal exposure to noise. Even though I often set up notifications for some new automated job, as soon as I am comfortable that the job is working as expected, I get rid of the notifications. I had Twitter set up in my Outlook, but I found it distracting and removed it. I don't check my work e-mail on the weekends unless there is something critical going on that I need to stay on top of.

    While the change in the way we communicate is great in some ways, you need to control how you interact with it or you will become less productive.

  • I totally agree with Ben. For some processes I WANT to know both if it fails and if it succeeds. In particular I have my database backups email me on success or failure and every day I look over these emails to make sure the backups ran. Then I periodically go out and check the actual backup location as well just in case the email could be lying to me, and on top of that I periodically restore one of these backups to a testing box.

    I was trained that paranoia in terms of backups was one of the traits of a good DBA.

  • It's a question of scale. At one point I used to get every message, but when you start to manage hundreds of instances, and thousands of databases, getting success is noise. In fact, the one problem is that you start to get used to getting the emails and it's easy to miss something.

    I do understand wanting to know that things worked. At scale, one thing I've done is build a report that consolidates all backup messages. It includes all successes, but we used to just verify the email arrived (one email / day) as a note that things worked correctly. At the top of the email, however, we printed (in red), anything that failed or was noticeable. That way we would look for red at the top and check failures, but not read the entire report.

    Our list of failures was:

    - Backup not run

    - report not received from a particular server (at the central reporting server)

    - backup grow > 20% from previous day

    - Any job failure

  • Steve,

    Does your report go out and check the backup location to make sure the backups are really there? This would be so cool.

    How often do you do a test restore on a backup for each database on average?

    Our shop is quite small so this works for us now, but the number of databases is growing and if I could automate a report instead of getting these emails it would be nice.

    What did you use to write the report?

  • The way I used to do it is I'd have each machine check itself. It was a combination of VBScript and stored procs to check the actual backup file itself overnight for existence, modified date, and size. This would be stored on each machine, so even if the central machine was down, we could get a "report" from each server.

    Then a central machine would query each other machine for it's data, and consolidate it into one report, and send an email. A left join would tell us if we were missing a report from a machine.

    We also included new databases as a "red" item.

    In terms of restores, we would randomly do a few a week from the list, just because of space. We couldn't afford to do daily restores on all databases. Now, if I had a tool like Virtual Restore from Red Gate, or the one from Idera, I'd automate a "mount" of the backup to be sure it worked without error on each server.

    I'd also do this in Powershell these days.

    The key for me was to have each machine check itself. That way if we had any network issue, we still had the data on that machine. Build a process that you can deploy to any machine, and let it gather data (and trim it over time). Then you can easily automate a centralization of this data. We used SSIS, but you could use linked servers if you want.

  • Ben:

    BenWard (6/14/2011)


    ...issues have occurred without our ever having received an email.

    I'm going to take both sides on this one. I agree with you about not wanting to be blindsided about a problem because EMail wasn't sent.

    I'm with Steve about not paging people unnecessarily, and not sending a large amount of data (the recipient may be reading the message on a phone). For example, maybe a test system only needs to be checked hourly, instead of every five minutes.

    And nobody needs to get the kind of pages we get when we're oncall:

    2:00 AM CRITICAL: Test database login failed!

    2:02 AM CLEAR: Test database login succeeded.


    Peter MaloofServing Data

  • Regarding the Success vs. Fail messages, SQL Server has long had known issues always emailing, and it can't tell you when it's failing email because... it can't email.

    One shop I used to work at had all success emails being routed to a particular exchange account. Some bright wizkid on the team had interfaced with it via .NET, and dumped success/fail information down to a single server that the DBA's would constantly monitor.

    THAT server would detect if we got all our successes, and if so, send out our success email. If it didn't, it would tell us failure, what server(s) were missing, and when. We'd then go get our hands on that machine and find out why it was being whiny.

    If the app got failure messages, it would re-route the message (forward it in exchange, basically) to the DBA email groups.

    One heck of an app, I really wish I knew how they built it. It covered both sides of the problem nicely.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • It can definitely be tricky to get the right balance when it comes to alerting. I think the biggest issue I've seen in addition to simple noise is reducing the false positives which can have the same effect.

    For instance, when monitoring that server X is alive and well you obviously want to be informed as quickly as possible if it stops responding, however what if the real issue is your link between the monitoring system and server X. Add into that if you want to monitor a few specific services on server X, you only want to be alerted if that service stops responding, not if the entire server or the connection to it is down. You need to get the right balance, so your connection alerts more rapidly than your server alerts, which alert more rapidly than your service alerts.

    In addition to that it can be easy to think that you need to know everything, but I think at times you need to be strict with yourself and decide what is and isn't critical and in different circumstances, and which alerting methods fit best. So taking mail servers as an example, we've got redundancy in place so if one goes offline then that's not a major issue. The team receives an e-mail alert 24x7 so we can easily see what has happened, but when it comes to paged alerts that can be overkill. During the hours when the engineers can be expected to be awake we'll page them if there's a failure, but if a single mail server goes offline at 3am I sure as hell don't want to be woken up to be told about it. We also have monitoring of our mail servers as a whole, and if none of them respond then we'll get alerted regardless of what time it is.

    The final thing is making allowances for blips, for instance where I've got one server reporting if it's unable to connect to another. If the comms drop off briefly, or a VPN connection resets I really don't care, invariably the connection will come back in a few minutes, long before I've read the e-mail and can try doing anything about it. In all our monitoring scripts and systems I try to include an element of historical tracking, eg don't alert me on the first failure, but if the check is still failing after x many tests then it's unlikely to recover on its own and I need to know about it.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply