Wednesday, May 29, 2013

Tip of the day - How to tell if your job has completed

If you are still staring at your trace file trying to figure out of the job you ran in designer has finished yet, then you may find today's post very helpful.

Go to Tools --> Options...

Then break out Designer and click on General.

Then click the box that says: Show Dialog when job is completed.


Now whenever your job completes, you'll get a little dialog box popping up to let you know.







Friday, May 24, 2013

Tip of the day - Seeing the full name of your object

One of the annoying defaults in Data Services is that all the names in workflows or dataflows are cut off after 17 characters.



 So to fix this go to Tools -> Options

Then break out Designer and click on General.

Where is says: Number of characters in workspace name. Change the number 17 to 100. Click OK when it asks you if you want to overwrite the job server parameters.



Now you be able to see the full names of objects in your workspace.





Tuesday, May 7, 2013

Creating a schedule with a file dependency

So last post I talked about how to create schedules where they were dependent upon other jobs being completed first.

In this post I will show you how to make you job dependent on a file. This is really useful where your ETL needs to wait for an external process to complete first before it can run.

To do this I created a function called WaitForFileDependency which you can take a look at below:

#First check if the overide parameter has been set. If it hasn't been set then assume it should be false.

$P_Override = nvl($P_Override,'FALSE');

#Print some details about the file we are looking for in the trace file
print('File Exists: ' || file_exists($P_PreRequisiteFile));
print('File Create Date: ' || get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000'));
print('Last Job Run Date: ' || $P_LastRunDate) ;
print('Time Out Time: ' || $P_TimeOut);

#If the override variable was false, then start checking to see if the file is present.
if ($P_Override = 'FALSE')
begin
print('Dependency file ' || $P_PreRequisiteFile );
                while (
(
                                #If the file is not yet present
(file_exists($P_PreRequisiteFile) = 0) 
   or 
                                #Or the file is present but its from a previous run and so is an old file
(get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000') < $P_LastRunDate) 
 )
                and 
                                        #And we have not yet reached our timeout time
systime() < $P_TimeOut
)
begin
print('Waiting for file ' || $P_PreRequisiteFile || ' to arrive.');
                                #The the file hasn't arrive yet and you should wait a while and then check again.
sleep(300000);
end

               #If we reach the timeout time and the file has not appeared yet then fail the job
if (systime() > $P_TimeOut)
begin
print('Current Time: ' || systime() || '> Timeout time: ' || $P_TimeOut);
raise_exception( 'File not found before timout reached. Job will now fail');
end

                #If the file exists, and it was created since the last time we ran the job, then you can start running the current job
if ((file_exists($P_PreRequisiteFile) = 1) 
   AND
(get_file_create_date( $P_PreRequisiteFile , '1900.01.01 00:00:0000') > $P_LastRunDate) )
begin
print('File ' || $P_PreRequisiteFile || ' found');
print ('Starting current job.');
end
end 
else 
begin
                #If the override variable was set to true, then just start running the job without waiting for the file to appear
print ('Dependency Override in effect. Will begin job immediately.');
end
Return 1;

The call to the function looks like this:

WaitForFileDependency($G_PreRequisiteFile,$G_Override_Dependency,$L_LastCompleteDate,to_date('1900.01.01' || $G_TimeOut,'YYYY.MM.DD HH24:MI'));


  • $G_PreRequisiteFile - This is the file, with its path, that you are waiting to appear
  • $G_Override_Dependency - If its TRUE then start the job immediatly, but if its FALSE then wait for the file to appear
  • $L_LastCompleteDate - This is the last time the job was executed
  • $G_TimeOut - This is the time of day you want to stop looking for the file and fail the job. If you normally expect the file to arrive by 5:00am, and its still not present by 6:00pm, then you probably want to stop waiting for the file to arrive and alert people to this fact. 

Friday, May 3, 2013

Job Control Table - Part 5 - Sequencing Jobs

One of the things missing from Data Services is the ability to chain jobs together so that as one job finishes the next one starts. While you could put the functionality of each job into a workflow and then create a large job that executes the whole thing, what if you don't want that?

Often it's safer to have a few little jobs, that way if one little job fails the whole process is not brought down, just that one job. Also, we might have a job that loads all the dimensions, and many jobs for the various fact tables. Another scenario could be if you need to load an ODS first, and then build data marts on that ODS data.

Well as the DS scheduler doesn't give us the option of creating job dependencies, I have created my own method, using the SYS_AUDIT_JOB table as my guide.

As seen in the previous posts on the job control table, we can always tell by looking at the SYS_AUDIT_JOB table whether or not a job has run on a particular day. We have a start time, and end time, and we have a job status of STARTED, COMPLETE or FAILED. So we can use this information to create functions that will allow us to make a job wait until a previous job has completed.

The main function I have created is called WaitForJobDependency. This stops a job from executing until its prior jobs has completed. I call it in my Initialise script like this:


#Wait for prerequisite jobs to complete
WaitForJobDependency('JOB_ODS',$G_Override_Dependency );
WaitForJobDependency('JOB_DIMENSIONS',$G_Override_Dependency );

So the above is tell my job to wait for 2 jobs to finish before it runs. The global variable is there in case I want to override the wait, and make my job run immediately.

The function looks likes this:

#First check if the override value is set, if its been left off them assume its false. If true jump to the else part of the if statement
$P_Override = nvl($P_Override,'FALSE');

if ($P_Override = 'FALSE')
begin
#If the status of the previous job is not COMPLETE, FAILED or ERRORS then wait another 5 minutes and check again. The job will continue checking indefinitely until one of the job statuses are met.
while ( PreRequisiteJobStatus($P_PreRequisiteJob) != 'COMPLETE'
   and PreRequisiteJobStatus($P_PreRequisiteJob) != 'FAILED'
and PreRequisiteJobStatus($P_PreRequisiteJob) != 'ERRORS')
begin
print('Dependency job ' || $P_PreRequisiteJob || ' status: ' || PreRequisiteJobStatus($P_PreRequisiteJob));
print('Waiting for job ' || $P_PreRequisiteJob || ' to complete.');
sleep(300000);
end

#If the job status of the previous job was FAILED or ERRORS then it failed, and this job will fail too to prevent you further corrupting that data by running your job of of incomplete data.
if (PreRequisiteJobStatus($P_PreRequisiteJob) = 'FAILED' OR PreRequisiteJobStatus($P_PreRequisiteJob) = 'ERRORS')
begin
print('Job dependency ' || $P_PreRequisiteJob || ' failed  or contained errors.');
print('Aborting current job.');
raise_exception( 'Job dependency failed');
end

#If the previous job came back as complete, then you are good to start your current job.
if (PreRequisiteJobStatus($P_PreRequisiteJob) = 'COMPLETE')
begin
print('Dependency job ' || $P_PreRequisiteJob || ' status: ' || PreRequisiteJobStatus($P_PreRequisiteJob));
print ('Starting current job.');
end
end
else

#This just lets the user now that they put the override dependency into effect and the job will start immediately without checking for any previous jobs.
begin
print ('Dependency Override in effect. Will begin job immediately.');
end

Return 1;



You May have noticed i referenced a function called PreRequisiteJobStatus above. This a fairly simple function that just returns the status of the prerequisite job like this:


$L_Status = nvl(sql('DATAMART_SQL_TARGET_DS','
Select STATUS from dbo.sys_audit_job where JOB_NAME = {$P_PreRequisiteJob}
                                             and cast(JOB_END_DATETIME as DATE) = cast(GETDATE()as DATE)
and AUDIT_ID = (Select Max(AUDIT_ID) from sys_audit_job where JOB_NAME = {$P_PreRequisiteJob})
'), 'WAITING');
Return $L_Status;


Now all you need to do is schedule all your jobs to start around the same time, and these functions will make sure that they run in the correct order.

Thursday, April 18, 2013

Job Control Table - Part 4 - Preventing Simultaneous Job Execution - When DS lets you down......

So now we have a great method for preventing simultaneous job execution, but sometimes DS let you down and fails badly; it doesn't execute the try catch block and so doesn't doesn't write back to the SYS_AUDIT_JOB table to say that the job has stopped with errors. Instead it just leaves a line in the table with a start time, but no end time.

This could be caused by a server failure, or one of those fantastically helpful <ACCESS VIOLATION> errors.

When we come to run our job, the IsAnotherJobRunning function will look in the table to see if another job is running, and because DS didn't close off the last entry in there, it will think that a job is running, even though it isn't.

To overcome this problem there are 2, well maybe 3, solutions:

1. Manually insert an end time in the affected row in SYS_AUDIT_JOB. The problem with this method is that it requires you to open the database and manually run some SQL, or to edit the row with SQL Management Studio or some other DB front end tool. If your ETL support staff don't have that access then you need something a little easier to use.

2. ClearFailedJob is a job I've created that runs a little piece of SQL that adds a cleared row with an end time. This method is preferred as it keeps everything within DS, and allows support staff who don't have direct access to the database to correct the issue.

Here is how it works:

Create a job called ClearFailedJob with a string variable to insert the name of the job that has failed.

The only thing in you job will be a script that will create a "Clear" row.








The script inside the object is:


#Insert a new record into the SYS_AUDIT_JOB table
SQL('DATAMART_SQL_TARGET_DS','INSERT INTO SYS_AUDIT_JOB (AUDIT_ID, JOB_NAME, JOB_START_DATETIME, JOB_END_DATETIME, STATUS, ETL_VERSION) SELECT MAX(AUDIT_ID) + 1, {$G_JobName}, GetDate(), GetDate(), \'CLEARED\', 0 FROM SYS_AUDIT_JOB');

print('****************************************************');
print('** ');
print('** Inserted a CLEAR record into SYS_AUDIT_JOB.');
print('** ');
print('****************************************************');

So prior to executing the job we have an orphaned row in SYS_AUDIT_JOB that looks like this:









We then execute the job like this:







And after we run ClearFailedJobs we have a row like this:










Now you can run you job without IsAnotherJobRunning thinking you already have a job running. Also, this allows you to keep track of how often DS is failing badly and just dying on you without executing the try/catch blocks.

3. Adding a timeout to your IsAnotherJob function so that it only looks back a certain amount of time. So lets say your job only ever runs for 30 mins, so you set a timeout for 1 hour. So if you are trying to run a job 1 hour after the last run, and the last run failed without updating the SYS_AUDIT_JOB table, then the next run will start regardless. This method can be risky though, as your job may legitimately be taking longer than 1 hr to run and you end up clashing with it by mistake. So only consider this option if you are sure your job will never take longer than a certain amount of time to run.

Tuesday, April 16, 2013

Job Control Table - Part 3 - Preventing Simultaneous Job Execution

So you're in the management console and click execute on a job, look away for a few seconds and then can't remember if you actually clicked execute or not. So you click it again, only to realize you now have two instances of the same job running simultaneously.

As the one instance is loading data the other is dropping data, and before you know it you've corrupted your data warehouse and end up doing a full reload.

Or perhaps you have a client who wants near real time loads of their data warehouse  On average the job takes 8 minutes to run so you set it to run every 10 minutes. Which is fine except for those occasions where more data than usual needs to be processed and the job takes 11 minutes to run. The next job starts before the previous one is finished and ends up again corrupting your data.

So how can we stop data services from running more than one instance of a job at a time?

The easy way is to check the SYS_AUDIT table as a first step in any ETL load. If there is a row for the job in the tablewithout an end time, then there is an instance of that job still running, and we'll want to prevent any further executions of the job until the current execution is complete.

So to do this I have created a function called IsAnotherJobRunning. And the code for it looks like this:


print('****************************************************');
print('** ');
print('** Checking for existing jobs.');
print('** ');

#First, find the status of the last job.
$L_LastJobStatus = SQL('DATAMART_SQL_TARGET_DS','SELECT STATUS FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last job status: {$L_LastJobStatus}.');

$L_LastEndTime = SQL('DATAMART_SQL_TARGET_DS','SELECT JOB_END_DATETIME FROM SYS_AUDIT_JOB WHERE AUDIT_ID = (SELECT MAX(AUDIT_ID) FROM SYS_AUDIT_JOB where JOB_NAME = {job_name()})');
print('** Last end time: {$L_LastEndTime}.');

#We assume that if there is an end time then the job has finished.
if($L_LastEndTime IS NULL)
begin
print('** Running job found. ');
print('** ');
print('****************************************************');
return 1; 
end
else
begin
print('** Last job ended: {$L_LastEndTime}.');
print('** No running jobs found. ');
print('** ');
print('****************************************************');
return 0;
end

I then wrap all my jobs in a conditional which calls this function.








This is how the function is called:







If the function returns 0 there are no other jobs running and the execution can proceed.

If the function finds another job running it returns a 1 and the Conditional executes the Else. In my jobs I create a simple script that just returns some basic information to the trace file.

Eg.


print('****************************************************');
print('** ');
print('** Another ' || Job_Name() || ' job appears to already be running - this job must be aborted to prevent corruption or inconsistencies.');
print('** ');
print('****************************************************');


I put this in as standard in all jobs I create as it has saved me a few times. It's also handy when the client decides that nightly loads aren't enough and they want multiple loads during the day; you can just increase the load frequency and not worry about the loads crashing into one another.

Monday, April 15, 2013

Job Control Table - Part 2 - Auditing

So you may have noticed I called my job control table SYS_AUDIT in the previous past, as opposed to JOB_RUN or JOB_NUMBER. 

The other names are fine and might make more sense today, but the name relates back to the history of why I first created this kind of table.


There was a requirement to know which rows were loaded within each load so that we could trace back any data issues to a particular job run. Perhaps there was a data issue in the source system at load time, or there was an error in a particular load and we needed to clear things out.


In order to do this, I created the SYS_AUDIT table, and then I stamped the AUDIT_ID against every row loaded by the job. 


In my previous post I created a new line in the SYS_AUDIT table by assigning the audit id to a global variable called $G_AuditID like this:


$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);


I then use that variable in my dataflows to assign an audit id to every row. To do that I create a field on every table called AUDIT_ID in which I stored $G_Audit_ID.


Now if I need to get just the rows that were updated for a particular load I can get them with some pretty simple SQL.


Select * from TABLE where AUDIT_ID = 32


Also, if a load just inserted a whole bunch of dodgy data I can get rid of it.


Delete from TABLE where AUDIT_ID = 32