Monday, October 27, 2014

Files as data sources - Part 1 - Waiting for the files to arrive


In this post I am going to take a look at how to handle file arrival.

You have a few options for dealing with file arrival.
  1. Run the job at a set time each day. If the file is there it gets processed, and if it isn't, you just try again tomorrow.
  2. Run the job frequently, say every couple of minutes. Each time the job runs it attempts to process the file. If the file is there is gets processed, and if it isn't Data Services just tries again on the next run. This method is useful if the file could arrive multiple times a day, and you can't be sure exactly when the file will arrive.
  3. Start the job, and have the job wait for the file to arrive. This method is best if you're processing just one instance of the file everyday, but can't be exactly sure when the file will arrive, and you want the ETL run as soon as it arrives.
For dealing with all three options above I have written a function called IsFileAvailable that makes use of the built in DS function called wait_for_file. The reason I don't use the wait_for_file function alone is because I have built in additional functionality that will enable me to deal with some of the other issues I mentioned in the introduction. I will be going through those in subsequent posts.

Here is what the functions looks like:

$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1, $L_FileNames, $L_FileListSize, ',');

if ($L_FileFound = 1)
begin
   $L_Counter = 1;
   WHILE ($L_Counter <= $L_FileListSize)
   begin
      FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
      $L_Counter = $L_Counter + 1;
   end
end

Return $L_FileFound;

The above function first waits for the file\s to arrive, and then writes a record in to a file handling table using the FileNew function.

I am using the wait_for_file function to determine when the file arrives.

The return values from the this function are:
    0 - No file matched
    1 - At least one file was matched
   -1 - The function timed out with no files found
   -2 - One of the input values is illegal

I'm generally only interested if a file has been found ie if the value 1 is returned.

The first few parameters are fairly straight forward.

Firstly it needs to know the name and location of the file you are waiting for. This can contain a wildcard, so if you are waiting for any file that starts with the letters file, you can set that value to be file*.txt. If you are not certain of the extension you can also have it be file*.*, and if you don't even care what the file name is, as long as any file arrives you can set the value as *.* .

The next parameter is how long you would like Data Services to wait for the file to arrive, the timeout parameter. This is set in milliseconds, so if you want Data Services to wait 30 minutes for the file to arrive, then that value should be 30 * 60 (seconds) * 1,000 (milliseconds) to get the value 1,800,000. If the timeout duration expires, 30 minutes in this example, then the wait_for_file function will return the value -1. This means that it looks for the file for 30 minutes, but no file arrived.

The 3rd parameter is how often you want Data Services to check whether the file has arrived. Again its the same formula for setting the value. If you want it to have a look ever 5 minutes then its 5 * 60 * 1,000 to get 300,000.

The next 4 parameters are all about returning the names of the files that Data Services finds.

In this example I have -1 set for the max match parameter. This means that I want DS to return the names of all the matched files that it finds. You could set this to 0 if you don't want any of them, or any other positive number if you only want a specific number of file names returned.

The next parameter is an output parameter that will store the list of file names returned. So lets say you set the 1st parameter in the function to file*.txt, and then there are 3 files in the directory: file1.txt, file2.txt and file3.txt. This variable will hold all three of those file names.

The next parameter will return the number of files found that match the search pattern. So again if you're looking for file*.txt, and 3 files are found that match file*.txt, then this output parameter will return the value 3.

The final parameter in the function allows you to set the list separator for the list of file. In this example I set it to be a comma. So the variable I have above called $L_FileNames, will end up with the values File1.txt, File2.txt, File3.txt.

The next part of the IsFileAvailable function loops through the list of file names and calls another function I have written called FileNew for each of the file name values in the list. The purpose of the FileNew function is to write a record into my SYS_FILE table for each file found.

I'll be going through the purpose of the SYS_FILE table and how you can use it to tie up data in the target table to the source files in my next post.

No comments:

Post a Comment