SQL Server Agent Job Running, but not running

  • Here's one I can't figure out and have been unable to find anything like it by doing internet searches:

    At our customer site, there is a SQL Server 2000 multi-step SQL Server Agent job that controls the ETL process for a data warehouse by starting a couple of DTS packages. The job has been running well for a number of years (around 8 years, actually) and has not changed. Over the past year or so, the job began exhibiting the behavior I am about to describe on rare ocassions... maybe averaging 2 or 3 times per month on any one of the four ETL servers in the system. Recently, the problem has become much more regular. Although it is still seemingly random, the frequency has jumped to nearly one server being affected per day.

    Since our customer has to manually intervene every time this happens, and the job schedule includes a midnight start (which fails to start in this scenario), the customer is becoming quite tired of the behavior. Since they don't employ a SQL Server DBA, they turn to me to solve these types of problems. Of course, my DBA experience is limited to OJT and experience, and this problem exceeds my troubleshooting capabilities because, likely, I don't have the the in-depth training to understand what's going on behind the scenes. Here's hoping you all can help.

    The behavior:

    A "master" SQL Server Agent Job runs a DTS package that checks to make sure the ETL servers are ready to run. The package checks a parameter table to make the determinition.

    If the parameters are properly set, as they would be at the end of a previously successful ETL process, then the "master" job starts up the main ETL job. This main job which we call Houston, resides on the local SQL Server and, in some setups, on a remote/linked server as well (when more than one ETL server feed a single Data Warehouse).

    The Houston job runs to completion, and here's where the problems begin. The system tables now show the job is not running, as appropriate. However, in EM, the job still shows "executing". If this state is seen prior to the next scheduled run, the procedure is to right-click on the job and choose to stop it. When that is done, an error message box comes back saying the job cannot be stopped because it is not running. After clicking OK on the message box, the status of the job has gone from "executing" to "cancelled".

    At this point, the scheduled job can kick off as normal. If this limbo-like status is not caught by observation, though, the scheduled "master" job will attempt to start the Houston job and then fail, reporting back that the Houston job is running. The fix at this point is the same as described above.

    The only thing that I know of which has changed aroung the time the acceleration of this problem from ocassional to nightly (but still random) was refreshing the DTS packages on the servers.

    Has anyone see something like this before or have an idea as to what might be causing the issue?

    I've copied the job create script below in case seeing it might help (user and password changed to ???).

    Thank you in advance for your help!!

    -- Script generated on 4/21/2010 11:58 AM

    BEGIN TRANSACTION

    DECLARE @JobID BINARY(16)

    DECLARE @ReturnCode INT

    SELECT @ReturnCode = 0

    IF (SELECT COUNT(*) FROM msdb.dbo.syscategories WHERE name = N'[Uncategorized (Local)]') < 1

    EXECUTE msdb.dbo.sp_add_category @name = N'[Uncategorized (Local)]'

    -- Delete the job with the same name (if it exists)

    SELECT @JobID = job_id

    FROM msdb.dbo.sysjobs

    WHERE (name = N'Houston')

    IF (@JobID IS NOT NULL)

    BEGIN

    -- Check if the job is a multi-server job

    IF (EXISTS (SELECT *

    FROM msdb.dbo.sysjobservers

    WHERE (job_id = @JobID) AND (server_id <> 0)))

    BEGIN

    -- There is, so abort the script

    RAISERROR (N'Unable to import job ''Houston'' since there is already a multi-server job with this name.', 16, 1)

    GOTO QuitWithRollback

    END

    ELSE

    -- Delete the [local] job

    EXECUTE msdb.dbo.sp_delete_job @job_name = N'Houston'

    SELECT @JobID = NULL

    END

    BEGIN

    -- Add the job

    EXECUTE @ReturnCode = msdb.dbo.sp_add_job @job_id = @JobID OUTPUT , @job_name = N'Houston', @owner_login_name = N'sa', @description = N'Execute package: Houston', @category_name = N'[Uncategorized (Local)]', @enabled = 0, @notify_level_email = 0, @notify_level_page = 0, @notify_level_netsend = 0, @notify_level_eventlog = 2, @delete_level= 0

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    -- Add the job steps

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_id = @JobID, @step_id = 1, @step_name = N'Archive Status', @command = N'dtsrun /S(local) /U ??? /P??? /N "Stargate Truncate Stargate_Status Conditional"', @database_name = N'', @server = N'', @database_user_name = N'', @subsystem = N'CmdExec', @cmdexec_success_code = 0, @flags = 0, @retry_attempts = 0, @retry_interval = 1, @output_file_name = N'', @on_success_step_id = 2, @on_success_action = 4, @on_fail_step_id = 2, @on_fail_action = 4

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_id = @JobID, @step_id = 2, @step_name = N'Houston', @command = N'dtsrun /S(local) /U ??? /P??? /N "Houston"', @database_name = N'', @server = N'', @database_user_name = N'', @subsystem = N'CmdExec', @cmdexec_success_code = 0, @flags = 0, @retry_attempts = 0, @retry_interval = 0, @output_file_name = N'', @on_success_step_id = 0, @on_success_action = 1, @on_fail_step_id = 3, @on_fail_action = 4

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_id = @JobID, @step_id = 3, @step_name = N'create fail file', @command = N' Set fso = CreateObject("Scripting.FileSystemObject")

    set a = fso.CreateTextFile("D:\Stargate\DW_LoadOutput\status\load_fail")

    set fso = nothing

    a.close

    set a = nothing', @database_name = N'VBScript', @server = N'', @database_user_name = N'', @subsystem = N'ActiveScripting', @cmdexec_success_code = 0, @flags = 0, @retry_attempts = 0, @retry_interval = 1, @output_file_name = N'', @on_success_step_id = 4, @on_success_action = 4, @on_fail_step_id = 0, @on_fail_action = 2

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_id = @JobID, @step_id = 4, @step_name = N'eti log search', @command = N'dtsrun /S(local) /U ??? /P??? /N "Stargate v3.0 ETI Log Search"', @database_name = N'', @server = N'', @database_user_name = N'', @subsystem = N'CmdExec', @cmdexec_success_code = 0, @flags = 0, @retry_attempts = 0, @retry_interval = 1, @output_file_name = N'', @on_success_step_id = 5, @on_success_action = 4, @on_fail_step_id = 5, @on_fail_action = 4

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_id = @JobID, @step_id = 5, @step_name = N'log houston failure', @command = N'D:\Stargate\DW_Load\bin\general_log.cmd Houston Fail', @database_name = N'', @server = N'', @database_user_name = N'', @subsystem = N'CmdExec', @cmdexec_success_code = 0, @flags = 0, @retry_attempts = 0, @retry_interval = 1, @output_file_name = N'', @on_success_step_id = 0, @on_success_action = 2, @on_fail_step_id = 0, @on_fail_action = 2

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    EXECUTE @ReturnCode = msdb.dbo.sp_update_job @job_id = @JobID, @start_step_id = 1

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    -- Add the Target Servers

    EXECUTE @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @JobID, @server_name = N'(local)'

    IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

    END

    COMMIT TRANSACTION

    GOTO EndSave

    QuitWithRollback:

    IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION

    EndSave:

  • *bump*

    This one's not going away for me, and I still haven't found the solution. Anyone have an idea?

    Thanks,

    Jason

  • I have seen almost this exact behavior. In my case, the job "thought" it completed, but the DTS process never ended.

    This was quite a while ago for me, so my memory is a little vauge, but I remember that I found that even though the job was not running, DTSRun.exe was still executing. Once I killed DTSRun.exe, things went back to normal. From there I was able to find there was a problem with the package. I'm sorry that I can't be more specific, but I would suggest that you check to see if DTSRun.exe is running when your job has supposedly completed- if so, have a look at the job and see if some unusual data condition is the culprit.

  • That makes sense! Thank you. I'll pass the info on to the customer and they can test the theory next time it happens (which will probably be tonight).

    I'll let you know how that goes.

    I guess the automated workaround would be to script something that looks for DTSrun.exe at the end of job and kills any processes it finds. Kinda brute force, but better than having to manually check for the condition on multiple servers twice a day, including at midnight.

    🙂

  • Well, the customer hasn't told me the problem has gone away, but they won't give me an update on it, either. Not sure what's up with that, but I will post with the update when I can get it...

  • Update:

    The original diagnosis was incorrect. The job *does* report as running in msdb. The DTSRun.exe process is still running after the DTS package seems to have completed.

    I used DTS package logging to try to capture what, if any, step was still running in a package when the DTSRun.exe process hung. There were no DTS steps running at all... all completed.

    The problem is still happening, but not as frequently. The random nature of the issue, coupled with the research showing that there are now DTS packages or steps running when the job/process is hung, leads me to believe this is a system/server issue caused by some as yet unknown resource or process conflict, DLL corruption, etc.

    Jason

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply