Wednesday, May 29, 2013

Tip of the day - How to tell if your job has completed

If you are still staring at your trace file trying to figure out of the job you ran in designer has finished yet, then you may find today's post very helpful.

Go to Tools --> Options...

Then break out Designer and click on General.

Then click the box that says: Show Dialog when job is completed.


Now whenever your job completes, you'll get a little dialog box popping up to let you know.







Friday, May 24, 2013

Tip of the day - Seeing the full name of your object

One of the annoying defaults in Data Services is that all the names in workflows or dataflows are cut off after 17 characters.



 So to fix this go to Tools -> Options

Then break out Designer and click on General.

Where is says: Number of characters in workspace name. Change the number 17 to 100. Click OK when it asks you if you want to overwrite the job server parameters.



Now you be able to see the full names of objects in your workspace.





Tuesday, May 7, 2013

Creating a schedule with a file dependency

So last post I talked about how to create schedules where they were dependent upon other jobs being completed first.

In this post I will show you how to make you job dependent on a file. This is really useful where your ETL needs to wait for an external process to complete first before it can run.

To do this I created a function called WaitForFileDependency which you can take a look at below:

#First check if the overide parameter has been set. If it hasn't been set then assume it should be false.

$P_Override = nvl($P_Override,'FALSE');

#Print some details about the file we are looking for in the trace file
print('File Exists: ' || file_exists($P_PreRequisiteFile));
print('File Create Date: ' || get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000'));
print('Last Job Run Date: ' || $P_LastRunDate) ;
print('Time Out Time: ' || $P_TimeOut);

#If the override variable was false, then start checking to see if the file is present.
if ($P_Override = 'FALSE')
begin
print('Dependency file ' || $P_PreRequisiteFile );
                while (
(
                                #If the file is not yet present
(file_exists($P_PreRequisiteFile) = 0) 
   or 
                                #Or the file is present but its from a previous run and so is an old file
(get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000') < $P_LastRunDate) 
 )
                and 
                                        #And we have not yet reached our timeout time
systime() < $P_TimeOut
)
begin
print('Waiting for file ' || $P_PreRequisiteFile || ' to arrive.');
                                #The the file hasn't arrive yet and you should wait a while and then check again.
sleep(300000);
end

               #If we reach the timeout time and the file has not appeared yet then fail the job
if (systime() > $P_TimeOut)
begin
print('Current Time: ' || systime() || '> Timeout time: ' || $P_TimeOut);
raise_exception( 'File not found before timout reached. Job will now fail');
end

                #If the file exists, and it was created since the last time we ran the job, then you can start running the current job
if ((file_exists($P_PreRequisiteFile) = 1) 
   AND
(get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000') > $P_LastRunDate) )
begin
print('File ' || $P_PreRequisiteFile || ' found');
print ('Starting current job.');
end
end 
else 
begin
                #If the override variable was set to true, then just start running the job without waiting for the file to appear
print ('Dependency Override in effect. Will begin job immediately.');
end
Return 1;

The call to the function looks like this:

WaitForFileDependency($G_PreRequisiteFile,$G_Override_Dependency,$L_LastCompleteDate,to_date('1900.01.01' || $G_TimeOut,'YYYY.MM.DD HH24:MI'));


  • $G_PreRequisiteFile - This is the file, with its path, that you are waiting to appear
  • $G_Override_Dependency - If its TRUE then start the job immediatly, but if its FALSE then wait for the file to appear
  • $L_LastCompleteDate - This is the last time the job was executed
  • $G_TimeOut - This is the time of day you want to stop looking for the file and fail the job. If you normally expect the file to arrive by 5:00am, and its still not present by 6:00pm, then you probably want to stop waiting for the file to arrive and alert people to this fact. 

Friday, May 3, 2013

Job Control Table - Part 5 - Sequencing Jobs

One of the things missing from Data Services is the ability to chain jobs together so that as one job finishes the next one starts. While you could put the functionality of each job into a workflow and then create a large job that executes the whole thing, what if you don't want that?

Often it's safer to have a few little jobs, that way if one little job fails the whole process is not brought down, just that one job. Also, we might have a job that loads all the dimensions, and many jobs for the various fact tables. Another scenario could be if you need to load an ODS first, and then build data marts on that ODS data.

Well as the DS scheduler doesn't give us the option of creating job dependencies, I have created my own method, using the SYS_AUDIT_JOB table as my guide.

As seen in the previous posts on the job control table, we can always tell by looking at the SYS_AUDIT_JOB table whether or not a job has run on a particular day. We have a start time, and end time, and we have a job status of STARTED, COMPLETE or FAILED. So we can use this information to create functions that will allow us to make a job wait until a previous job has completed.

The main function I have created is called WaitForJobDependency. This stops a job from executing until its prior jobs has completed. I call it in my Initialise script like this:


#Wait for prerequisite jobs to complete
WaitForJobDependency('JOB_ODS',$G_Override_Dependency );
WaitForJobDependency('JOB_DIMENSIONS',$G_Override_Dependency );

So the above is tell my job to wait for 2 jobs to finish before it runs. The global variable is there in case I want to override the wait, and make my job run immediately.

The function looks likes this:

#First check if the override value is set, if its been left off them assume its false. If true jump to the else part of the if statement
$P_Override = nvl($P_Override,'FALSE');

if ($P_Override = 'FALSE')
begin
#If the status of the previous job is not COMPLETE, FAILED or ERRORS then wait another 5 minutes and check again. The job will continue checking indefinitely until one of the job statuses are met.
while ( PreRequisiteJobStatus($P_PreRequisiteJob) != 'COMPLETE'
   and PreRequisiteJobStatus($P_PreRequisiteJob) != 'FAILED'
and PreRequisiteJobStatus($P_PreRequisiteJob) != 'ERRORS')
begin
print('Dependency job ' || $P_PreRequisiteJob || ' status: ' || PreRequisiteJobStatus($P_PreRequisiteJob));
print('Waiting for job ' || $P_PreRequisiteJob || ' to complete.');
sleep(300000);
end

#If the job status of the previous job was FAILED or ERRORS then it failed, and this job will fail too to prevent you further corrupting that data by running your job of of incomplete data.
if (PreRequisiteJobStatus($P_PreRequisiteJob) = 'FAILED' OR PreRequisiteJobStatus($P_PreRequisiteJob) = 'ERRORS')
begin
print('Job dependency ' || $P_PreRequisiteJob || ' failed  or contained errors.');
print('Aborting current job.');
raise_exception( 'Job dependency failed');
end

#If the previous job came back as complete, then you are good to start your current job.
if (PreRequisiteJobStatus($P_PreRequisiteJob) = 'COMPLETE')
begin
print('Dependency job ' || $P_PreRequisiteJob || ' status: ' || PreRequisiteJobStatus($P_PreRequisiteJob));
print ('Starting current job.');
end
end
else

#This just lets the user now that they put the override dependency into effect and the job will start immediately without checking for any previous jobs.
begin
print ('Dependency Override in effect. Will begin job immediately.');
end

Return 1;



You May have noticed i referenced a function called PreRequisiteJobStatus above. This a fairly simple function that just returns the status of the prerequisite job like this:


$L_Status = nvl(sql('DATAMART_SQL_TARGET_DS','
Select STATUS from dbo.sys_audit_job where JOB_NAME = {$P_PreRequisiteJob}
                                             and cast(JOB_END_DATETIME as DATE) = cast(GETDATE()as DATE)
and AUDIT_ID = (Select Max(AUDIT_ID) from sys_audit_job where JOB_NAME = {$P_PreRequisiteJob})
'), 'WAITING');
Return $L_Status;


Now all you need to do is schedule all your jobs to start around the same time, and these functions will make sure that they run in the correct order.

Thursday, April 18, 2013

Job Control Table - Part 4 - Preventing Simultaneous Job Execution - When DS lets you down......

So now we have a great method for preventing simultaneous job execution, but sometimes DS let you down and fails badly; it doesn't execute the try catch block and so doesn't doesn't write back to the SYS_AUDIT_JOB table to say that the job has stopped with errors. Instead it just leaves a line in the table with a start time, but no end time.

This could be caused by a server failure, or one of those fantastically helpful <ACCESS VIOLATION> errors.

When we come to run our job, the IsAnotherJobRunning function will look in the table to see if another job is running, and because DS didn't close off the last entry in there, it will think that a job is running, even though it isn't.

To overcome this problem there are 2, well maybe 3, solutions:

1. Manually insert an end time in the affected row in SYS_AUDIT_JOB. The problem with this method is that it requires you to open the database and manually run some SQL, or to edit the row with SQL Management Studio or some other DB front end tool. If your ETL support staff don't have that access then you need something a little easier to use.

2. ClearFailedJob is a job I've created that runs a little piece of SQL that adds a cleared row with an end time. This method is preferred as it keeps everything within DS, and allows support staff who don't have direct access to the database to correct the issue.

Here is how it works:

Create a job called ClearFailedJob with a string variable to insert the name of the job that has failed.

The only thing in you job will be a script that will create a "Clear" row.








The script inside the object is:


#Insert a new record into the SYS_AUDIT_JOB table
SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID, JOB_NAME, JOB_START_DATETIME, JOB_END_DATETIME, STATUS, ETL_VERSION) SELECT MAX(AUDIT_ID) + 1, {$G_JobName}, GetDate(), GetDate(), \'CLEARED\', 0 FROM SYS_AUDIT_JOB');

print('****************************************************');
print('** ');
print('** Inserted a CLEAR record into SYS_AUDIT_JOB.');
print('** ');
print('****************************************************');

So prior to executing the job we have an orphaned row in SYS_AUDIT_JOB that looks like this:









We then execute the job like this:







And after we run ClearFailedJobs we have a row like this:










Now you can run you job without IsAnotherJobRunning thinking you already have a job running. Also, this allows you to keep track of how often DS is failing badly and just dying on you without executing the try/catch blocks.

3. Adding a timeout to your IsAnotherJob function so that it only looks back a certain amount of time. So lets say your job only ever runs for 30 mins, so you set a timeout for 1 hour. So if you are trying to run a job 1 hour after the last run, and the last run failed without updating the SYS_AUDIT_JOB table, then the next run will start regardless. This method can be risky though, as your job may legitimately be taking longer than 1 hr to run and you end up clashing with it by mistake. So only consider this option if you are sure your job will never take longer than a certain amount of time to run.

Tuesday, April 16, 2013

Job Control Table - Part 3 - Preventing Simultaneous Job Execution

So you're in the management console and click execute on a job, look away for a few seconds and then can't remember if you actually clicked execute or not. So you click it again, only to realize you now have two instances of the same job running simultaneously.

As the one instance is loading data the other is dropping data, and before you know it you've corrupted your data warehouse and end up doing a full reload.

Or perhaps you have a client who wants near real time loads of their data warehouse  On average the job takes 8 minutes to run so you set it to run every 10 minutes. Which is fine except for those occasions where more data than usual needs to be processed and the job takes 11 minutes to run. The next job starts before the previous one is finished and ends up again corrupting your data.

So how can we stop data services from running more than one instance of a job at a time?

The easy way is to check the SYS_AUDIT table as a first step in any ETL load. If there is a row for the job in the tablewithout an end time, then there is an instance of that job still running, and we'll want to prevent any further executions of the job until the current execution is complete.

So to do this I have created a function called IsAnotherJobRunning. And the code for it looks like this:


print('****************************************************');
print('** ');
print('** Checking for existing jobs.');
print('** ');

#First, find the status of the last job.
$L_LastJobStatus = SQL('DATAMART_SQL_TARGET_DS','SELECT STATUS FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last job status: {$L_LastJobStatus}.');

$L_LastEndTime = SQL('DATAMART_SQL_TARGET_DS','SELECT JOB_END_DATETIME FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last end time: {$L_LastEndTime}.');

#We assume that if there is an end time then the job has finished.
if($L_LastEndTime IS NULL)
begin
print('** Running job found. ');
print('** ');
print('****************************************************');
return 1; 
end
else
begin
print('** Last job ended: {$L_LastEndTime}.');
print('** No running jobs found. ');
print('** ');
print('****************************************************');
return 0;
end

I then wrap all my jobs in a conditional which calls this function.








This is how the function is called:







If the function returns 0 there are no other jobs running and the execution can proceed.

If the function finds another job running it returns a 1 and the Conditional executes the Else. In my jobs I create a simple script that just returns some basic information to the trace file.

Eg.


print('****************************************************');
print('** ');
print('** Another ' || Job_Name() || ' job appears to already be running - this job must be aborted to prevent corruption or inconsistencies.');
print('** ');
print('****************************************************');


I put this in as standard in all jobs I create as it has saved me a few times. It's also handy when the client decides that nightly loads aren't enough and they want multiple loads during the day; you can just increase the load frequency and not worry about the loads crashing into one another.

Monday, April 15, 2013

Job Control Table - Part 2 - Auditing

So you may have noticed I called my job control table SYS_AUDIT in the previous past, as opposed to JOB_RUN or JOB_NUMBER. 

The other names are fine and might make more sense today, but the name relates back to the history of why I first created this kind of table.


There was a requirement to know which rows were loaded within each load so that we could trace back any data issues to a particular job run. Perhaps there was a data issue in the source system at load time, or there was an error in a particular load and we needed to clear things out.


In order to do this, I created the SYS_AUDIT table, and then I stamped the AUDIT_ID against every row loaded by the job. 


In my previous post I created a new line in the SYS_AUDIT table by assigning the audit id to a global variable called $G_AuditID like this:


$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);


I then use that variable in my dataflows to assign an audit id to every row. To do that I create a field on every table called AUDIT_ID in which I stored $G_Audit_ID.


Now if I need to get just the rows that were updated for a particular load I can get them with some pretty simple SQL.


Select * from TABLE where AUDIT_ID = 32


Also, if a load just inserted a whole bunch of dodgy data I can get rid of it.


Delete from TABLE where AUDIT_ID = 32

Friday, April 12, 2013

Job Control Table - Part 1 - Setup

In this series of posts I am going to take you through how to setup a job control table, and the many benefits of having a job control table.

At the most basic level, the point of the job control table is to keep a record of every job run, when it started, when it finished, and whether it was successful.

Here is a sample of the type of data you might have in a control table:







So looking at the above data we can see a few things.
  1. The job normally takes around 40 mins
  2. It took a lot longer on the 3rd
  3. It failed on the 4th
  4. It was started again on the 4th and is still running
We are now able to build up some statistics about our job using this simple table.

Here is a simple SQL Server script that will create the above table:

CREATE TABLE [dbo].[SYS_AUDIT_JOB](
    [AUDIT_ID] [int] NOT NULL,
    [JOB_NAME] [varchar](100) NULL,
    [JOB_START_DATETIME] [datetime] NULL,
    [JOB_END_DATETIME] [datetime] NULL,
    [STATUS] [varchar](15) NULL,
 CONSTRAINT [PK_SYS_AUDIT_JOB] PRIMARY KEY CLUSTERED
(
    [AUDIT_ID] ASC
)
) ON [PRIMARY]


The reason I put the prefix SYS on this table is to differentiate from the rest of the warehouse tables that would normally have DIM or FACT as their prefix. SYS means its a table used by the ETL, and is not intended for reporting. At least not by end users.

In order to load this table with the correct information I have created some Data Services functions.

The first function is called NewAuditRecord. As the name implies, its purpose is to create a new record in the audit table when you start the job. Here is the code for the function:

#################################################################
#This function will log the start of the job and its details into
#the SYS_AUDIT_JOB table and return an unique Audit_ID for the job
#################################################################


#Check if the input job name is not null. If not then get the next value from the sequance
#number and return as Audit_ID. And insert a new record into the SYS_AUDIT_JOB table  marking
#the start of the job with current date and time


if($P_JobName is not null)
begin

    #Get the unique Audit_id
    $L_AuditID = SQL('DATAMART_SQL_TARGET_DS','SELECT max(AUDIT_ID) from SYS_AUDIT_JOB') + 1;
   
    #Insert a new record into the SYS_AUDIT_JOB table
    $L_Ret = SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID,JOB_NAME,JOB_START_DATETIME, JOB_END_DATETIME,STATUS, ETL_VERSION) VALUES ([$L_AuditID],{$P_JobName},{$P_EffectiveDateAndTime},NULL,\'STARTED\',{$G_ETLVersion})');
   

    print('****************************************************');
    print('** ');
    print('** Inserted new record into SYS_AUDIT_JOB.');
    print('** ');
    print('** Return value = {$L_Ret}');   
    print('** Audit ID = {$L_AuditID}');   
    print('** ');
    print('****************************************************');

    #return the unique Audit_Id for this Job
    Return $L_AuditID ;

end

Return 0;


As you can see its a pretty straight forward function in that you just pass it you job name, and the current datetime, and it inserts a row in the SYS_AUDIT_JOB table. Finally it prints out some information to the trace table. It also returns the audit_id to a variable called $G_AuditID - this will come in handy in part 2.

You may also notice I set the status to STARTED.

Here is how I call the function in the initialise script of my job:

$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);

$G_JobName is a global variable and I populate like this:

$G_JobName = job_name();

$G_EffectiveDatAndTime is also a global variable that I populate like this:

$G_EffectiveDateAndTime = sysdate();

Once the job is complete I have another function called EndJob which bascially has this piece of code in it:

SQL('DATAMART_SQL_TARGET_DS','UPDATE SYS_AUDIT_JOB SET JOB_END_DATETIME = getdate(), STATUS = \'COMPLETE\' WHERE AUDIT_ID = {$P_AuditID} AND JOB_NAME = {$P_JobName}');

You'll notice that not only did I update the end time, but also set the status to COMPLETE.

One more thing...........

What if the job fails?

Well, in my try\catch block I have a function called HandleException. This sends out an email on failure, but it also closes off the line in the order table with the fail time and a status of FAILED.

SQL('DATAMART_SQL_TARGET_DS','UPDATE SYS_AUDIT_JOB SET JOB_END_DATETIME = getdate(), STATUS = \'FAILED\' WHERE AUDIT_ID = {$G_AuditID} AND JOB_NAME = {$G_JobName}');

So now that we have setup the job control table, my next few posts will take you through the benefits of having it beyond just simple job statistics.