Dude, The Server's Down

  • My buddy, Dean, recently sent me an IM with KB article asking me if I'd experienced this. He has 10 servers, supposedly all built the same and setup exactly as each other, but 2 are acting differently. Every 30 days 2 of them shut down around midnight for no reason. No logs, no errors, nothing.

    So they are looking around and find the KB article. Makes sense to them, so they are planning on upgrading to SP4 and hopefully solving the issue. Being in the online, e-commerce business, downtime is a bit of a problem for them and even though he is not a DBA, he and his colleagues are under pressure to solve the problem.

    So I said I'd look at the servers and dig in and see if I could find anything. I started with the KB article, which is interesting in and of itself. It states in the symptoms exactly what is happening. The server shuts down with no logging. And the resolution is given as upgrade to SP4 (or the latest SP) and then a Hotfix is listed. It confirms that this problem in SP4. The cause is given as ...

    Oh wait. There's no cause given.

    Is this a mistake or do they have no idea? And if they are not sure what the root cause is, then what exactly does the hotfix do? Did they just deploy some changes and it stopped happening on their test servers? Isn't there something like WinDbg for SQL Server that they can stick on a test server and find out the issue?

    Perhaps the cause was inadvertantly left out. It's likely that this is just a data entry error by someone that loads the KB articles up. I hope so.

    I'd hate to think that the engineers in Redmond have no idea and threw something over the wall.

    Steve Jones

    PS, congrats to Syracuse on winning the Big East. An exciting few days in sports.

  • 'No cause' - hmmm - I've run into this less than a handful of times over the last decade. Not too shabby a record in my book. However, if it is not an oversight, then it is very troubling. The only thing more troublesome is when you call PSS and find out that you've run into an 'internally documented' bug. 'internally documented' means at least one or two people have run into it, but not enough to hit the magical' (undocumented) threshhold to actually publish a public KB article with a fix or workaround !

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • It's one of three things:

    1) Sounds like the time gets out of sync between the servers.

    2) Memory is not fixed and uses variable memory for the amount used by SQL SERVER.  I'm putting forward that some programming processes on them might let the OS run out of memory and cause a reboot.

    3) Chuck Norris flexed his arms and the sonic boom rebooted the servers.

  • Maybe its a feature to keep the job interesting.

    I have a similar exciting and apparently inexplicable feature in Visual Studio 2005. When I open the Help system and it crashes (for a known reason that is presently irreparable), the entire IDE magically disappears without comment, error or bothering to save open files. I used to consider it a bug until I realized it was really AI in action. As I had the help file open, the IDE deemed I was probably on the wrong track and tossed my last several minutes of work. I know it's a feature now because immediately after such events (about once a week), I restart and for some reason manage to solve the problem for the codeing team without needing to read any references.

  • A little more info: They've re-IP'd the machines. One has been rebuilt completely (New OS install, SQL, etc) and the two machines perform the same function; one is a backup of another. Also they are behind a load balancer with only one active.

    The timing between boots is around 30 days, but strangely enough, if they reboot due to patches, the time sometimes shortens. No pattern that is discernable from their end.

    Also, I did see their reboot does not match the KB article in that the logs appear to show a "stop" message being sent to the server. My thought is that there is some scheduled process somewhere that sends a reboot to the machines. Some QA, SMS, or maybe MOM process. Or that someone wrote code with "shutdown" inside it that is rarely executed or maybe "miscommented" and runs periodically.

  • One reason for the scheduled reboot might be to clear the logs, which the internal ones can get quite large especially if the logging has been turned up.  OR the logs are being written too fast and that can cause it to reboot as well.

    You were not clear if they were DC's or if any other software was installed.  You know the typical things like AV and exchange could be complicating the matter.

    Now are all the drivers the same. I know that the Gigabit cards are known to have some flaws though I think they are only on obscure settings.  Are there multiple nics on the boxes? That can be awkward as well.

    Are there trust relationships involved? Are there scheduled defrags some of that software can force a reboot?

      But now i'm just grasping at Chuck Norris.  And you know noone grasps at Chuck Norris and lives except Chuck Norris.

    Some event logs and further topology might help. 

    I guess I'm starting from the outside and trying to discount those before blaming the mighty SQL SERVER...

  • We had the same thing happen to four of our production servers as well. However, this was caused by the unauthorized installation of a program called StyleList API that the development team was using to fix some uppercase and lowercase issues during hygiene processes. After uncovering the cause and sternly admonishing the responsible parties, I made contact with the vendor to see if they had a more current version or a fix. Unfortunately they did not, so the program was uninstalled. No more problem.

  • Three thoughts:

     

    First, for me at least, your friend’s problem about servers stopping mysteriously in the middle of the night doesn’t seem to be closely related to this KB article. This KB article talks about the MSSQL service shutting down, not the server shutting down. Could be related of course, but I would look elsewhere first. Or am I misreading your post? Did your friend see these entries in the SQL logs about the MSSQL service shutting down?

     

    Second, it doesn’t seem unusual to me for KB articles to list a symptom without a detailed explanation of the underlying cause. Then follow up with a statement that the problem is fixed by an upgrade to the latest service pack or a hotfix.  My guess is that any attempt a detailed explanation would lead to many more questions and phone calls about the interpretation. And for many people, all they want to know is that the problem is fixed with the hotfix or service pack.

     

    Third, it seems to me that this KB article states that the problem with MSSQL service stopping IS fixed by an upgrade to SP4.  The confusing part of this article is that it also references a hotfix, not the SP4 service pack.  I interpreted this to mean that a PRE SP4 hotfix is available for those SP3 users that don’t want to move to SP4. To confirm this, I did a quick file version check. All of my servers are at SP4.  I checked my version of a couple of the files; life’s much too short to check all the files. I found that some of my SP4 files were NEWER than those listed in the hotfix. So I assume my interpretation is correct.

     

  • The only thing that is "better" than the "Internally Documented Errors" is "This Feature is By Design" (Microsoft) and "Type of Defect: Defect-Customer" (Third-Party Vendor)

    Don't let me go into examples....

     

    Regards,Yelena Varsha

  • Has anyone ever used the blackbox trace?  I was reading about it in the SQL Server magazine, March 2006.  The article is about SQL Server 2005, but that feature is also available in SQL 2000.  Wonder if that would yield any clues?

    http://support.microsoft.com/default.aspx?scid=kb;en-us;281671

  • I can't believe that no one has suggested this, but is there any maintenance(vacuumming, trash removal) people scheduled to do anything during those times?

    It's possible that a maid is unplugging the server to use the outlet for her vacuum cleaner and then plugging the server back in when she is finished.

    If there is no UPS connected to the machine or there is a problem with the shutdown software, this would explain the no-log problem.

    Sometimes the problem is simple one. 


    Live to Throw
    Throw to Live
    Will Summers

  • Other servers if its a DC would show its loss on the network in there logs, services would show restarting in the logs and win2k3 would be asking for a reason why they rebooted. I doubt they keep at least 6 servers on an pplug removable by ther maintenence personel.

    One thing that could be related though, is a faulty UPS, where a brown out or bad battery system might force it to shutdown due to not consistent enough power. 

  • Thanks for the suggestions. An SP4 application is set for this week to see if that helps.

    The servers are in a colocation facility, actually a few racks down from the SSC servers. No maintenance personnel allowed with vacuums

    The shutdown is weird. It happened one time to the passive, unloaded server a few minutes after a patch reboot. No users on the system.

    Very strange.

  • I'd run SFC.

  • Does the system have a 'nice' shutdown or is it simply turning itsself off?

    I would look into the internal temperature of the system or another nonsoftware problem.  Maybe you have a fan that's stuck on an internal wire or it's spinning too slow.   If the fan isn't hooked up to any kind of monitoring sensor, then you would never know.

    The system may stay on the edge of shutting down during the day and then at night the building warms up (to save money on AC) and puts the server over the edge.  We had a problem with system that was unstable and the room was warm.  We installed a air conditioner in the room and now the server runs fine.

    Lots of things can lead to instability.  I know everyone is pointing to a software problem, but it may be hardware or environment related.  I have seen weirder things.

    With 10 servers built the same and with all the same patches, and only 2 of them are acting up, I would look for a non-software problem.

    The same software installed on two machines and the software runs differently on one, then it's a hardware problem.  Why isn't it happening on all of the servers if it's a software problem.  Doesn't make sense.


    Live to Throw
    Throw to Live
    Will Summers

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply