Thursday, April 18, 2013

Job Control Table - Part 4 - Preventing Simultaneous Job Execution - When DS lets you down......

So now we have a great method for preventing simultaneous job execution, but sometimes DS let you down and fails badly; it doesn't execute the try catch block and so doesn't doesn't write back to the SYS_AUDIT_JOB table to say that the job has stopped with errors. Instead it just leaves a line in the table with a start time, but no end time.

This could be caused by a server failure, or one of those fantastically helpful <ACCESS VIOLATION> errors.

When we come to run our job, the IsAnotherJobRunning function will look in the table to see if another job is running, and because DS didn't close off the last entry in there, it will think that a job is running, even though it isn't.

To overcome this problem there are 2, well maybe 3, solutions:

1. Manually insert an end time in the affected row in SYS_AUDIT_JOB. The problem with this method is that it requires you to open the database and manually run some SQL, or to edit the row with SQL Management Studio or some other DB front end tool. If your ETL support staff don't have that access then you need something a little easier to use.

2. ClearFailedJob is a job I've created that runs a little piece of SQL that adds a cleared row with an end time. This method is preferred as it keeps everything within DS, and allows support staff who don't have direct access to the database to correct the issue.

Here is how it works:

Create a job called ClearFailedJob with a string variable to insert the name of the job that has failed.

The only thing in you job will be a script that will create a "Clear" row.








The script inside the object is:


#Insert a new record into the SYS_AUDIT_JOB table
SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID, JOB_NAME, JOB_START_DATETIME, JOB_END_DATETIME, STATUS, ETL_VERSION) SELECT MAX(AUDIT_ID) + 1, {$G_JobName}, GetDate(), GetDate(), \'CLEARED\', 0 FROM SYS_AUDIT_JOB');

print('****************************************************');
print('** ');
print('** Inserted a CLEAR record into SYS_AUDIT_JOB.');
print('** ');
print('****************************************************');

So prior to executing the job we have an orphaned row in SYS_AUDIT_JOB that looks like this:









We then execute the job like this:







And after we run ClearFailedJobs we have a row like this:










Now you can run you job without IsAnotherJobRunning thinking you already have a job running. Also, this allows you to keep track of how often DS is failing badly and just dying on you without executing the try/catch blocks.

3. Adding a timeout to your IsAnotherJob function so that it only looks back a certain amount of time. So lets say your job only ever runs for 30 mins, so you set a timeout for 1 hour. So if you are trying to run a job 1 hour after the last run, and the last run failed without updating the SYS_AUDIT_JOB table, then the next run will start regardless. This method can be risky though, as your job may legitimately be taking longer than 1 hr to run and you end up clashing with it by mistake. So only consider this option if you are sure your job will never take longer than a certain amount of time to run.

No comments:

Post a Comment