Monday, November 24, 2014

Files as data sources - Part 5 - Missing Files

To handle files that don't arrive I check through the SYS_FILE table to see if there is an entry for that file for the day. If there is no entry I add a row to the SYS_FILE table to record the fact that no file arrived, and then I send out an email to person responsible for that file, to let them know that the file did not arrive.

Here is the basic script:


$L_EmailBody  = 'The following files did not arrive today: ';
$L_Counter = 0;

#Check if file $G_ActivityEmfe  has arrived
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
   $L_FileName = replace_substr( $G_ActivityEmfe ,'*','%');
END

IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION = {$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) = trunc(SYSDATE)'))
begin
$L_EmailBody  = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmfe;
$L_Counter = $L_Counter + 1;
end


#Check if file $G_ActivityEmms has arrived
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
   $L_FileName = replace_substr( $G_ActivityEmms ,'*','%');
END

IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION = {$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) = trunc(SYSDATE)'))
begin
$L_EmailBody  = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmms;
$L_Counter = $L_Counter + 1;
end

#If any files did not arrive, ie counter>0, then send email. if ($L_Counter > 0)
begin
    print($L_EmailBody);
smtp_to($G_AbsentReportingAddresses, 'Absent Files Report', $L_EmailBody , 0, 0);
end

Monday, November 17, 2014

Files as data sources - Part 4 - File Errors

I always place the dataflow that is reading the file within a try...catch block to specifically respond to any file errors that may occur.


The two error types you want to look out for are:

  1. Flat file processing errors (1004)
  2. File access errors (1005)
If an error does occur I run the following in the script:


#Record file failure in SYS_FILE table
FileFailed($G_FileLocation,$G_ActivityEmms);

#Archive file
FileMove($G_ActivityEmms, $G_FileLocation, $G_FileErrorLocation);

#Truncate staging table to prevent error data from getting into datamart
sql( 'Datawarehouse','TRUNCATE TABLE STG_ACTIVITY_EMMS_EM');

The FileFailed function has nearly the same code as the FileComplete function. You could actually combine them into a single function that updates the file status to either COMPLETE or FAILED, or anything else you might need.


If (db_type('Datawarehouse') = 'Oracle')
BEGIN
    $L_FileName = replace_substr( $P_FileName,'*','%');
END

sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'FAILED\' WHERE FILE_LOCATION = {$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');

We've already reviewed how the FileMove function works.

Lastly I clear out the staging table so that no partially loaded data goes through the rest of the load.

It would also be a good idea to send out an email alerting the owner of the file that there was something wrong with the file so that they can take an remedial action before resending the file for processing.

Monday, November 10, 2014

Files as data sources - Part 3 - Moving files around

So now that file has been processed, lets look at what we should do next.

I like to have an archive folder on the file system to keep any files that have been processed, this is so that I can go back for auditing and/or dealing with any queries about the data that may come up.

I also update the STATUS field in the SYS_FILE table that references the file to COMPLETE.

Lets start with the update function:

If (db_type('Datawarehouse') = 'Oracle')
BEGIN
    $L_FileName = replace_substr( $P_FileName,'*','%');
END

sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'COMPLETE\' WHERE FILE_LOCATION = {$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');

For this function I pass in the directory and file name, look up the relevant row\s in the SYS_FILE table and then update them to COMPLETE.

First thing I do is check the database type. If I am using Oracle I will need to convert any *s to % as that is the wildcard character in oracle. You can add additional database types in here as required.

I look for the rows in the SYS_FILE table that match the file name. The reason I update more than one row at a time is because when you have multiple files processed through the file object such as File1, File2 and File3, and you refer to the file name in your file object as File*; then these files are collated into one batch for processing. So if the dataflow completed successfully then all of the files have been loaded.

Once I've update the rows in the SYS_FILE table I move the files to the archive directory with a little FileMove function:

$L_FileName = word_ext( $P_FileName ,1,'.');

exec('sh',' -c "mv ' || $P_SourceDirectory || $L_FileName  || ' ' || $P_TargetDirectory || '"',8);

In this example my job server is running on a UNIX OS, but you could re-write this a Microsoft OS by replace the 'sh' with cmd, and using 'move' instead of 'mv'.

I use the word_ext function at the top to avoid any issues with file extensions and wildcards. You may not have this issue so will not necessarily have to include the first line of the function.




Monday, November 3, 2014

Files as data sources - Part 2 - Keeping track

In the last post I spoke about how to time your job with the file's arrival. In this post I will talk about how keep track of your files by keeping some basic metadata, and also how to relate the data in your target destination back to the original file.

Here again is the IsFileAvailable function I am using to wait for the files:

$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1, $L_FileNames, $L_FileListSize, ',');

if ($L_FileFound = 1)
begin
   $L_Counter = 1;
   WHILE ($L_Counter <= $L_FileListSize)
   begin
      FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
      $L_Counter = $L_Counter + 1;
   end
end

Return $L_FileFound;

Once I have found a file, or files, I then run a routine to step through each of the file names. I use the word_ext function to choose which file name in the list I want to access, and then a counter to step through the list.

For each file name I then call a function I wrote called FileNew:

#Get time the file was created
$L_FileDateTime = get_file_attibute($P_FullFilePathAndName, 'date_created')

#Get the file type eg .txt .csv .pipe
$L_Type = word_ext($L_FileName,2,'.');

#Get new file id
$L_File_ID = sql('Datawarehouse','Select max(FILE_ID) from SYS_FILE') + 1

#Insert a new record into the SYS_FILENAME table
SQL('Datawarehouse','INSERT INTO SYS_FILE (FILE_ID, DI_FILENAME, FILE_DATETIME, FILE_PROCESS_DATETIME, TYPE, STATUS, AUDIT_ID)
VALUES ([$L_File_ID]{$P_FullFilePathAndName}, to_date({$L_FileDateTime}, \'YYYY/MM/DD hh24:mi:ss\'), sysdate,  {$L_Type},  \'STARTED\', [$G_AuditID])');

Return $L_File_ID;

What I am doing here is firstly gathering some metadata about the file such as the date it was created and the type of file it is, and then I am inserting that data in a table called SYS_FILE which is where I keep track of all the files processed. I also assign a file_id by adding 1 to the max file_id currently in the table.

I set the status of the file to STARTED, to signify that we have started processing this file. Later on I will either set the file to COMPLETE or ERROR.

Here is what the data in the file looks like:


You could always expand the number of columns in the SYS_FILE table to store all sorts of additional metadata about the file such as its source, how it was delivered or who is responsible for it.

Now that I have created a record in the SYS_FILE table for a file, how do I link that to the records that are coming through from the file?

First of all you need to go into your file object and set the source information on the file as follows:


There are similar options on XML and Excel files too.

What this does is it adds an extra column at the end of your file that will store the file name along with its path. This is particularly important if you are receiving and processing multiple files at once because DS will batch process all the files through at the same time. So lets say you are looking for File* and the files received are File1, File2, and File3. DS will process all 3 files together at once. It will include the file name that a row of data came from in the DI_FILENAME column, and we will use that information to tie the data back to the specific row we have created for that file.

In order to tie up the row of data with the correct SYS_FILE entry, we use thelookup_ext function as follows:


So now that we have our file and have processed it, we need to decide what to do with the file. I'll be covering archiving and updating the SYS_FILE table to show the completed files in the next post.