Thursday, April 18, 2013

Job Control Table - Part 4 - Preventing Simultaneous Job Execution - When DS lets you down......

So now we have a great method for preventing simultaneous job execution, but sometimes DS let you down and fails badly; it doesn't execute the try catch block and so doesn't doesn't write back to the SYS_AUDIT_JOB table to say that the job has stopped with errors. Instead it just leaves a line in the table with a start time, but no end time.

This could be caused by a server failure, or one of those fantastically helpful <ACCESS VIOLATION> errors.

When we come to run our job, the IsAnotherJobRunning function will look in the table to see if another job is running, and because DS didn't close off the last entry in there, it will think that a job is running, even though it isn't.

To overcome this problem there are 2, well maybe 3, solutions:

1. Manually insert an end time in the affected row in SYS_AUDIT_JOB. The problem with this method is that it requires you to open the database and manually run some SQL, or to edit the row with SQL Management Studio or some other DB front end tool. If your ETL support staff don't have that access then you need something a little easier to use.

2. ClearFailedJob is a job I've created that runs a little piece of SQL that adds a cleared row with an end time. This method is preferred as it keeps everything within DS, and allows support staff who don't have direct access to the database to correct the issue.

Here is how it works:

Create a job called ClearFailedJob with a string variable to insert the name of the job that has failed.

The only thing in you job will be a script that will create a "Clear" row.








The script inside the object is:


#Insert a new record into the SYS_AUDIT_JOB table
SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID, JOB_NAME, JOB_START_DATETIME, JOB_END_DATETIME, STATUS, ETL_VERSION) SELECT MAX(AUDIT_ID) + 1, {$G_JobName}, GetDate(), GetDate(), \'CLEARED\', 0 FROM SYS_AUDIT_JOB');

print('****************************************************');
print('** ');
print('** Inserted a CLEAR record into SYS_AUDIT_JOB.');
print('** ');
print('****************************************************');

So prior to executing the job we have an orphaned row in SYS_AUDIT_JOB that looks like this:









We then execute the job like this:







And after we run ClearFailedJobs we have a row like this:










Now you can run you job without IsAnotherJobRunning thinking you already have a job running. Also, this allows you to keep track of how often DS is failing badly and just dying on you without executing the try/catch blocks.

3. Adding a timeout to your IsAnotherJob function so that it only looks back a certain amount of time. So lets say your job only ever runs for 30 mins, so you set a timeout for 1 hour. So if you are trying to run a job 1 hour after the last run, and the last run failed without updating the SYS_AUDIT_JOB table, then the next run will start regardless. This method can be risky though, as your job may legitimately be taking longer than 1 hr to run and you end up clashing with it by mistake. So only consider this option if you are sure your job will never take longer than a certain amount of time to run.

Tuesday, April 16, 2013

Job Control Table - Part 3 - Preventing Simultaneous Job Execution

So you're in the management console and click execute on a job, look away for a few seconds and then can't remember if you actually clicked execute or not. So you click it again, only to realize you now have two instances of the same job running simultaneously.

As the one instance is loading data the other is dropping data, and before you know it you've corrupted your data warehouse and end up doing a full reload.

Or perhaps you have a client who wants near real time loads of their data warehouse  On average the job takes 8 minutes to run so you set it to run every 10 minutes. Which is fine except for those occasions where more data than usual needs to be processed and the job takes 11 minutes to run. The next job starts before the previous one is finished and ends up again corrupting your data.

So how can we stop data services from running more than one instance of a job at a time?

The easy way is to check the SYS_AUDIT table as a first step in any ETL load. If there is a row for the job in the tablewithout an end time, then there is an instance of that job still running, and we'll want to prevent any further executions of the job until the current execution is complete.

So to do this I have created a function called IsAnotherJobRunning. And the code for it looks like this:


print('****************************************************');
print('** ');
print('** Checking for existing jobs.');
print('** ');

#First, find the status of the last job.
$L_LastJobStatus = SQL('DATAMART_SQL_TARGET_DS','SELECT STATUS FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last job status: {$L_LastJobStatus}.');

$L_LastEndTime = SQL('DATAMART_SQL_TARGET_DS','SELECT JOB_END_DATETIME FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last end time: {$L_LastEndTime}.');

#We assume that if there is an end time then the job has finished.
if($L_LastEndTime IS NULL)
begin
print('** Running job found. ');
print('** ');
print('****************************************************');
return 1; 
end
else
begin
print('** Last job ended: {$L_LastEndTime}.');
print('** No running jobs found. ');
print('** ');
print('****************************************************');
return 0;
end

I then wrap all my jobs in a conditional which calls this function.








This is how the function is called:







If the function returns 0 there are no other jobs running and the execution can proceed.

If the function finds another job running it returns a 1 and the Conditional executes the Else. In my jobs I create a simple script that just returns some basic information to the trace file.

Eg.


print('****************************************************');
print('** ');
print('** Another ' || Job_Name() || ' job appears to already be running - this job must be aborted to prevent corruption or inconsistencies.');
print('** ');
print('****************************************************');


I put this in as standard in all jobs I create as it has saved me a few times. It's also handy when the client decides that nightly loads aren't enough and they want multiple loads during the day; you can just increase the load frequency and not worry about the loads crashing into one another.

Monday, April 15, 2013

Job Control Table - Part 2 - Auditing

So you may have noticed I called my job control table SYS_AUDIT in the previous past, as opposed to JOB_RUN or JOB_NUMBER. 

The other names are fine and might make more sense today, but the name relates back to the history of why I first created this kind of table.


There was a requirement to know which rows were loaded within each load so that we could trace back any data issues to a particular job run. Perhaps there was a data issue in the source system at load time, or there was an error in a particular load and we needed to clear things out.


In order to do this, I created the SYS_AUDIT table, and then I stamped the AUDIT_ID against every row loaded by the job. 


In my previous post I created a new line in the SYS_AUDIT table by assigning the audit id to a global variable called $G_AuditID like this:


$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);


I then use that variable in my dataflows to assign an audit id to every row. To do that I create a field on every table called AUDIT_ID in which I stored $G_Audit_ID.


Now if I need to get just the rows that were updated for a particular load I can get them with some pretty simple SQL.


Select * from TABLE where AUDIT_ID = 32


Also, if a load just inserted a whole bunch of dodgy data I can get rid of it.


Delete from TABLE where AUDIT_ID = 32

Friday, April 12, 2013

Job Control Table - Part 1 - Setup

In this series of posts I am going to take you through how to setup a job control table, and the many benefits of having a job control table.

At the most basic level, the point of the job control table is to keep a record of every job run, when it started, when it finished, and whether it was successful.

Here is a sample of the type of data you might have in a control table:







So looking at the above data we can see a few things.
  1. The job normally takes around 40 mins
  2. It took a lot longer on the 3rd
  3. It failed on the 4th
  4. It was started again on the 4th and is still running
We are now able to build up some statistics about our job using this simple table.

Here is a simple SQL Server script that will create the above table:

CREATE TABLE [dbo].[SYS_AUDIT_JOB](
    [AUDIT_ID] [int] NOT NULL,
    [JOB_NAME] [varchar](100) NULL,
    [JOB_START_DATETIME] [datetime] NULL,
    [JOB_END_DATETIME] [datetime] NULL,
    [STATUS] [varchar](15) NULL,
 CONSTRAINT [PK_SYS_AUDIT_JOB] PRIMARY KEY CLUSTERED
(
    [AUDIT_ID] ASC
)
) ON [PRIMARY]


The reason I put the prefix SYS on this table is to differentiate from the rest of the warehouse tables that would normally have DIM or FACT as their prefix. SYS means its a table used by the ETL, and is not intended for reporting. At least not by end users.

In order to load this table with the correct information I have created some Data Services functions.

The first function is called NewAuditRecord. As the name implies, its purpose is to create a new record in the audit table when you start the job. Here is the code for the function:

#################################################################
#This function will log the start of the job and its details into
#the SYS_AUDIT_JOB table and return an unique Audit_ID for the job
#################################################################


#Check if the input job name is not null. If not then get the next value from the sequance
#number and return as Audit_ID. And insert a new record into the SYS_AUDIT_JOB table  marking
#the start of the job with current date and time


if($P_JobName is not null)
begin

    #Get the unique Audit_id
    $L_AuditID = SQL('DATAMART_SQL_TARGET_DS','SELECT max(AUDIT_ID) from SYS_AUDIT_JOB') + 1;
   
    #Insert a new record into the SYS_AUDIT_JOB table
    $L_Ret = SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID,JOB_NAME,JOB_START_DATETIME, JOB_END_DATETIME,STATUS, ETL_VERSION) VALUES ([$L_AuditID],{$P_JobName},{$P_EffectiveDateAndTime},NULL,\'STARTED\',{$G_ETLVersion})');
   

    print('****************************************************');
    print('** ');
    print('** Inserted new record into SYS_AUDIT_JOB.');
    print('** ');
    print('** Return value = {$L_Ret}');   
    print('** Audit ID = {$L_AuditID}');   
    print('** ');
    print('****************************************************');

    #return the unique Audit_Id for this Job
    Return $L_AuditID ;

end

Return 0;


As you can see its a pretty straight forward function in that you just pass it you job name, and the current datetime, and it inserts a row in the SYS_AUDIT_JOB table. Finally it prints out some information to the trace table. It also returns the audit_id to a variable called $G_AuditID - this will come in handy in part 2.

You may also notice I set the status to STARTED.

Here is how I call the function in the initialise script of my job:

$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);

$G_JobName is a global variable and I populate like this:

$G_JobName = job_name();

$G_EffectiveDatAndTime is also a global variable that I populate like this:

$G_EffectiveDateAndTime = sysdate();

Once the job is complete I have another function called EndJob which bascially has this piece of code in it:

SQL('DATAMART_SQL_TARGET_DS','UPDATE SYS_AUDIT_JOB SET JOB_END_DATETIME = getdate(), STATUS = \'COMPLETE\' WHERE AUDIT_ID = {$P_AuditID} AND JOB_NAME = {$P_JobName}');

You'll notice that not only did I update the end time, but also set the status to COMPLETE.

One more thing...........

What if the job fails?

Well, in my try\catch block I have a function called HandleException. This sends out an email on failure, but it also closes off the line in the order table with the fail time and a status of FAILED.

SQL('DATAMART_SQL_TARGET_DS','UPDATE SYS_AUDIT_JOB SET JOB_END_DATETIME = getdate(), STATUS = \'FAILED\' WHERE AUDIT_ID = {$G_AuditID} AND JOB_NAME = {$G_JobName}');

So now that we have setup the job control table, my next few posts will take you through the benefits of having it beyond just simple job statistics.