Dude, The Server's Down

Question

Dude, The Server's Down

Steve Jones - SSC Editor

SSC Guru

Points: 728164
More actions
March 12, 2006 at 10:35 am

#189716

My buddy, Dean, recently sent me an IM with KB article asking me if I'd experienced this. He has 10 servers, supposedly all built the same and setup exactly as each other, but 2 are acting differently. Every 30 days 2 of them shut down around midnight for no reason. No logs, no errors, nothing.
So they are looking around and find the KB article. Makes sense to them, so they are planning on upgrading to SP4 and hopefully solving the issue. Being in the online, e-commerce business, downtime is a bit of a problem for them and even though he is not a DBA, he and his colleagues are under pressure to solve the problem.
So I said I'd look at the servers and dig in and see if I could find anything. I started with the KB article, which is interesting in and of itself. It states in the symptoms exactly what is happening. The server shuts down with no logging. And the resolution is given as upgrade to SP4 (or the latest SP) and then a Hotfix is listed. It confirms that this problem in SP4. The cause is given as ...
Oh wait. There's no cause given.
Is this a mistake or do they have no idea? And if they are not sure what the root cause is, then what exactly does the hotfix do? Did they just deploy some changes and it stopped happening on their test servers? Isn't there something like WinDbg for SQL Server that they can stick on a test server and find out the issue?
Perhaps the cause was inadvertantly left out. It's likely that this is just a data entry error by someone that loads the KB articles up. I hope so.
I'd hate to think that the engineers in Redmond have no idea and threw something over the wall.
Steve Jones
PS, congrats to Syracuse on winning the Big East. An exciting few days in sports.
Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply

Rudyx - the Doctor SSC-Forever Points: 43695 More actions · Answer 1

'No cause' - hmmm - I've run into this less than a handful of times over the last decade. Not too shabby a record in my book. However, if it is not an oversight, then it is very troubling. The only thing more troublesome is when you call PSS and find out that you've run into an 'internally documented' bug. 'internally documented' means at least one or two people have run into it, but not enough to hit the magical' (undocumented) threshhold to actually publish a public KB article with a fix or workaround !

RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Edward W. Stanley SSCertifiable Points: 6462 More actions · Answer 2

It's one of three things:

1) Sounds like the time gets out of sync between the servers.

2) Memory is not fixed and uses variable memory for the amount used by SQL SERVER. I'm putting forward that some programming processes on them might let the OS run out of memory and cause a reboot.

3) Chuck Norris flexed his arms and the sonic boom rebooted the servers.

Frank Buchan Ten Centuries Points: 1395 More actions · Answer 3

Maybe its a feature to keep the job interesting.

I have a similar exciting and apparently inexplicable feature in Visual Studio 2005. When I open the Help system and it crashes (for a known reason that is presently irreparable), the entire IDE magically disappears without comment, error or bothering to save open files. I used to consider it a bug until I realized it was really AI in action. As I had the help file open, the IDE deemed I was probably on the wrong track and tossed my last several minutes of work. I know it's a feature now because immediately after such events (about once a week), I restart and for some reason manage to solve the problem for the codeing team without needing to read any references.

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 4

A little more info: They've re-IP'd the machines. One has been rebuilt completely (New OS install, SQL, etc) and the two machines perform the same function; one is a backup of another. Also they are behind a load balancer with only one active.

The timing between boots is around 30 days, but strangely enough, if they reboot due to patches, the time sometimes shortens. No pattern that is discernable from their end.

Also, I did see their reboot does not match the KB article in that the logs appear to show a "stop" message being sent to the server. My thought is that there is some scheduled process somewhere that sends a reboot to the machines. Some QA, SMS, or maybe MOM process. Or that someone wrote code with "shutdown" inside it that is rarely executed or maybe "miscommented" and runs periodically.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Edward W. Stanley SSCertifiable Points: 6462 More actions · Answer 5

One reason for the scheduled reboot might be to clear the logs, which the internal ones can get quite large especially if the logging has been turned up. OR the logs are being written too fast and that can cause it to reboot as well.

You were not clear if they were DC's or if any other software was installed. You know the typical things like AV and exchange could be complicating the matter.

Now are all the drivers the same. I know that the Gigabit cards are known to have some flaws though I think they are only on obscure settings. Are there multiple nics on the boxes? That can be awkward as well.

Are there trust relationships involved? Are there scheduled defrags some of that software can force a reboot?

But now i'm just grasping at Chuck Norris. And you know noone grasps at Chuck Norris and lives except Chuck Norris.

Some event logs and further topology might help.

I guess I'm starting from the outside and trying to discount those before blaming the mighty SQL SERVER...

Zixxer2Go SSC Rookie Points: 40 More actions · Answer 6

We had the same thing happen to four of our production servers as well. However, this was caused by the unauthorized installation of a program called StyleList API that the development team was using to fix some uppercase and lowercase issues during hygiene processes. After uncovering the cause and sternly admonishing the responsible parties, I made contact with the vendor to see if they had a more current version or a fix. Unfortunately they did not, so the program was uninstalled. No more problem.

Carl Kepford SSC-Addicted Points: 423 More actions · Answer 7

Three thoughts:

First, for me at least, your friend’s problem about servers stopping mysteriously in the middle of the night doesn’t seem to be closely related to this KB article. This KB article talks about the MSSQL service shutting down, not the server shutting down. Could be related of course, but I would look elsewhere first. Or am I misreading your post? Did your friend see these entries in the SQL logs about the MSSQL service shutting down?

Second, it doesn’t seem unusual to me for KB articles to list a symptom without a detailed explanation of the underlying cause. Then follow up with a statement that the problem is fixed by an upgrade to the latest service pack or a hotfix. My guess is that any attempt a detailed explanation would lead to many more questions and phone calls about the interpretation. And for many people, all they want to know is that the problem is fixed with the hotfix or service pack.

Third, it seems to me that this KB article states that the problem with MSSQL service stopping IS fixed by an upgrade to SP4. The confusing part of this article is that it also references a hotfix, not the SP4 service pack. I interpreted this to mean that a PRE SP4 hotfix is available for those SP3 users that don’t want to move to SP4. To confirm this, I did a quick file version check. All of my servers are at SP4. I checked my version of a couple of the files; life’s much too short to check all the files. I found that some of my SP4 files were NEWER than those listed in the hotfix. So I assume my interpretation is correct.

Yelena Varshal SSC-Dedicated Points: 34286 More actions · Answer 8

The only thing that is "better" than the "Internally Documented Errors" is "This Feature is By Design" (Microsoft) and "Type of Defect: Defect-Customer" (Third-Party Vendor)

Don't let me go into examples....

Regards,Yelena Varsha

Donna Hawley Newbie Points: 6 More actions · Answer 9

Has anyone ever used the blackbox trace? I was reading about it in the SQL Server magazine, March 2006. The article is about SQL Server 2005, but that feature is also available in SQL 2000. Wonder if that would yield any clues?

http://support.microsoft.com/default.aspx?scid=kb;en-us;281671

Will1922 SSCertifiable Points: 7189 More actions · Answer 10

I can't believe that no one has suggested this, but is there any maintenance(vacuumming, trash removal) people scheduled to do anything during those times?

It's possible that a maid is unplugging the server to use the outlet for her vacuum cleaner and then plugging the server back in when she is finished.

If there is no UPS connected to the machine or there is a problem with the shutdown software, this would explain the no-log problem.

Sometimes the problem is simple one.

Live to Throw
Throw to Live
Will Summers

Edward W. Stanley SSCertifiable Points: 6462 More actions · Answer 11

Other servers if its a DC would show its loss on the network in there logs, services would show restarting in the logs and win2k3 would be asking for a reason why they rebooted. I doubt they keep at least 6 servers on an pplug removable by ther maintenence personel.

One thing that could be related though, is a faulty UPS, where a brown out or bad battery system might force it to shutdown due to not consistent enough power.

Steve Jones - SSC Editor SSC Guru Points: 728164 More actions · Answer 12

Thanks for the suggestions. An SP4 application is set for this week to see if that helps.

The servers are in a colocation facility, actually a few racks down from the SSC servers. No maintenance personnel allowed with vacuums

The shutdown is weird. It happened one time to the passive, unloaded server a few minutes after a patch reboot. No users on the system.

Very strange.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Edward W. Stanley SSCertifiable Points: 6462 More actions · Answer 13

Edward W. Stanley

SSCertifiable

Points: 6462

March 14, 2006 at 7:18 am

#626494

I'd run SFC.

Will1922 SSCertifiable Points: 7189 More actions · Answer 14

Does the system have a 'nice' shutdown or is it simply turning itsself off?

I would look into the internal temperature of the system or another nonsoftware problem. Maybe you have a fan that's stuck on an internal wire or it's spinning too slow. If the fan isn't hooked up to any kind of monitoring sensor, then you would never know.

The system may stay on the edge of shutting down during the day and then at night the building warms up (to save money on AC) and puts the server over the edge. We had a problem with system that was unstable and the room was warm. We installed a air conditioner in the room and now the server runs fine.

Lots of things can lead to instability. I know everyone is pointing to a software problem, but it may be hardware or environment related. I have seen weirder things.

With 10 servers built the same and with all the same patches, and only 2 of them are acting up, I would look for a non-software problem.

The same software installed on two machines and the software runs differently on one, then it's a hardware problem. Why isn't it happening on all of the servers if it's a software problem. Doesn't make sense.

Live to Throw
Throw to Live
Will Summers