Monday, October 27, 2014

Files as data sources - Part 1 - Waiting for the files to arrive


In this post I am going to take a look at how to handle file arrival.

You have a few options for dealing with file arrival.
  1. Run the job at a set time each day. If the file is there it gets processed, and if it isn't, you just try again tomorrow.
  2. Run the job frequently, say every couple of minutes. Each time the job runs it attempts to process the file. If the file is there is gets processed, and if it isn't Data Services just tries again on the next run. This method is useful if the file could arrive multiple times a day, and you can't be sure exactly when the file will arrive.
  3. Start the job, and have the job wait for the file to arrive. This method is best if you're processing just one instance of the file everyday, but can't be exactly sure when the file will arrive, and you want the ETL run as soon as it arrives.
For dealing with all three options above I have written a function called IsFileAvailable that makes use of the built in DS function called wait_for_file. The reason I don't use the wait_for_file function alone is because I have built in additional functionality that will enable me to deal with some of the other issues I mentioned in the introduction. I will be going through those in subsequent posts.

Here is what the functions looks like:

$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1, $L_FileNames, $L_FileListSize, ',');

if ($L_FileFound = 1)
begin
   $L_Counter = 1;
   WHILE ($L_Counter <= $L_FileListSize)
   begin
      FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
      $L_Counter = $L_Counter + 1;
   end
end

Return $L_FileFound;

The above function first waits for the file\s to arrive, and then writes a record in to a file handling table using the FileNew function.

I am using the wait_for_file function to determine when the file arrives.

The return values from the this function are:
    0 - No file matched
    1 - At least one file was matched
   -1 - The function timed out with no files found
   -2 - One of the input values is illegal

I'm generally only interested if a file has been found ie if the value 1 is returned.

The first few parameters are fairly straight forward.

Firstly it needs to know the name and location of the file you are waiting for. This can contain a wildcard, so if you are waiting for any file that starts with the letters file, you can set that value to be file*.txt. If you are not certain of the extension you can also have it be file*.*, and if you don't even care what the file name is, as long as any file arrives you can set the value as *.* .

The next parameter is how long you would like Data Services to wait for the file to arrive, the timeout parameter. This is set in milliseconds, so if you want Data Services to wait 30 minutes for the file to arrive, then that value should be 30 * 60 (seconds) * 1,000 (milliseconds) to get the value 1,800,000. If the timeout duration expires, 30 minutes in this example, then the wait_for_file function will return the value -1. This means that it looks for the file for 30 minutes, but no file arrived.

The 3rd parameter is how often you want Data Services to check whether the file has arrived. Again its the same formula for setting the value. If you want it to have a look ever 5 minutes then its 5 * 60 * 1,000 to get 300,000.

The next 4 parameters are all about returning the names of the files that Data Services finds.

In this example I have -1 set for the max match parameter. This means that I want DS to return the names of all the matched files that it finds. You could set this to 0 if you don't want any of them, or any other positive number if you only want a specific number of file names returned.

The next parameter is an output parameter that will store the list of file names returned. So lets say you set the 1st parameter in the function to file*.txt, and then there are 3 files in the directory: file1.txt, file2.txt and file3.txt. This variable will hold all three of those file names.

The next parameter will return the number of files found that match the search pattern. So again if you're looking for file*.txt, and 3 files are found that match file*.txt, then this output parameter will return the value 3.

The final parameter in the function allows you to set the list separator for the list of file. In this example I set it to be a comma. So the variable I have above called $L_FileNames, will end up with the values File1.txt, File2.txt, File3.txt.

The next part of the IsFileAvailable function loops through the list of file names and calls another function I have written called FileNew for each of the file name values in the list. The purpose of the FileNew function is to write a record into my SYS_FILE table for each file found.

I'll be going through the purpose of the SYS_FILE table and how you can use it to tie up data in the target table to the source files in my next post.

Monday, October 20, 2014

Files as Data Sources - Introduction

When files are the source of your data there are a few things to take into consideration:

  • When will the files arrive?
  • How will you keep track of the files processed?
  • How do you relate data in your target tables with the original file?
  • What if multiple files arrive for the same data set?
  • What should you do with files once they've been processed
  • What happens if there is an error with the file?
  • What should you do if a file doesn't arrive?
For this blog series I am going to go through all the points above. Most of what I discuss will be relevant for both standard text files as well as xml and Excel files.

Here is the standard structure I use when processing a file through Data Services:




So firstly I used a conditional to check whether a file has arrived or not. I've created my own function, IsFileAvailable, to look out for the file arrival. Yes, I could have used the built in function wait_for_file, but there is additional functionality that you might find useful that I have built into IsFileAvailable.

If the file arrives I print that out to the trace file. You don't have to do this, but I just find it easier to see whats going on that way.

Then I place the data flow that will process the file within a try catch block. This is so that I can handle any file errors without bringing down the entire job. Within that error handler I can report the file errors to the owner\s of the file and move the file to an error file location.

In the else section of the conditional I place some code to handle the file not arriving.

Over the next few posts I'll break out the detail of how each of the above pieces works.


Monday, May 12, 2014

History preservation and using the filter in the table comparison transform

I just found another really handy use for the new filter field in the table comparison transform in the latest versions of Data Services.

If you are loading a dimension, and you are using history preservation, add a line in the filter field to only bring through current records.


This helps you with two things, one obvious, one not so obvious.

The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only need to compare against the current record of a given dimension value. Setting the generated key column makes sure the comparison only takes place against the most recent row, but the table comparison transform still has to go through all the records to get there. If you filter only for the current records, then you can compare against a smaller record set and get through it just a little bit quicker.

Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of what you had previously. It may not get you that much of a performance gain, but when you are trying to get your jobs down to near real time, then every second counts.

The second thing this helped with is dealing with expired historical data that becomes current again. Let me give you and example.

This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with a record in the asset team table that looked something like:


Team A became active again on 1-JAN-2014.

The problem I had was that before I put the filter in the table comparison to only bring through current records, the table comparison transform was ignoring Team A. It was doing this because it would look in the target table for a record with a natural key of A. It found the old record there, and because the effective dates are set by the history preserving transform, it was ignoring it because everything else was the same. I needed it to process this record as an insert so that the history preservation transform would process it correctly with the new effective date and CURR_IND equal to 'Y'.

By putting the filter in, I was filtering out the old Team A record from my comparison data set. That way the table comparison saw this as a new record and put it through as an insert. This is a fairly unique case where you have historical records for a dimension value but none of them are current. 

I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I had I think the best workaround would have been to create a view where you filter for current records, and then use that as the comparison data set.

Monday, May 5, 2014

SQL Transform - When should you use it?

Never!

You were expecting a longer post about this? Well alright then, never ever!

Whenever I see tons of SQL Transforms around a Data Services job, it is normally a sign of a lazy developer. Some-one who had written out the queries in SQL, and instead of building proper data flows using the Query transform and other built in Data Services functionality, just copied and pasted the sql into the SQL transform.

But what's wrong with that? Well plenty!

DS is NOT a glorified scheduler

For starters, you just bought a not-inexpensive ETL tool that you are now just using as a query scheduler. If all you want to do is automate your queries then write them into a  stored procedure and schedule it on the database. Don't waste your money on a world class ETL tool.

You've just broken Impact and Lineage analysis

One of the biggest selling points of using a tool like DS, is you can visually see where a field is used in your ETL and where its data ultimately ends up. Likewise, you can see where a particular piece of data came from by tracing it all the way back to the source system.

This is a fantastic governance tool for managing your warehouse. You can see if, and where, making a change to a source system will affect your data warehouse.

The second you put a SQL transform in the way, you've just broken your impact and lineage analysis. DS treats it as a black box, so whether you are using the lineage in the management console, Data Services Designer View Where Used function, or in Information Steward, you have just rendered it useless.

In the Datastore below it appears is if none of the tables are being used in any jobs.


But then after a little searching I find an SQL transform.


I look inside it and what do I find?


And this query is hardly complex. It wouldn't have taken very long to develop as a proper data flow with the Query transform, and now some future developer could make changes to the job thinking that they have everything covered, but not realize that 3 extra tables were hidden inside an SQL transform.

I can't tell you how many times I've needed to make a change, right clicked on the table in the data store and chosen View Were Used, made all my changes, only to later discover I've missed one that a developer hid in a SQL transform.

Squinting at code

One of the great things about using a GUI based ETL tool is that you can open a data flow and immediately get an idea of what it is doing. You can see the tables on the screen, see which transforms they flow through and understand what is happening to the data and where it is going. With a SQL transform you have to open it up, squint at the code to try and figure out what it is up to.

For simple SQL that's not a big deal, but a complicated query with plenty of tables and joins.... well now you're wasting my time, and my client's money too!

Should you really never use the SQL transform?

I worked in a company where they had a ban on using the SQL transform. Turn in your code with one in it, and it got returned to you to have it rewritten without the SQL transform. No exceptions.

I will admit, there are times when you will just have to use the SQL transform, but these should be rare exceptions. Sometimes you need to do something very complex, that's maybe only available on the database, so you have to use the SQL transform to take advantage of it.

Before you do it though, think really hard about whether you couldn't achieve the same thing using Data Services built in transforms. Even if you have to split it out over multiple data flows, it will still be better than using the SQL transform.

Monday, April 28, 2014

Generated Key in the Table Comparison Transform

When I first learnt how to use Data Services, I knew that I was supposed to set the Generated Key Column in the table comparison transform, but I never really thought about I why I was supposed to do that.

So lets take a look at the example above. In this example I have a dimension for Customer with a data warehouse key of CUSTOMER_KEY. This is my generated key column. I also have a field for the natural key that comes from the source system and this is CUSTOMER_NK.

I have been maintaining history on this table, so individual CUSTOMER_NK values can appear in there multiple times. Once for each row of history maintained. So if a customer has lived in more than one city, there will be a row in the customer table for each city the customer has lived in.

So data in the table might look something like this:


If a new row for John Smith arrives, lets say he moves to San Antonio, how will Data Services know which of the 3 rows with a CUSTOMER_NK of 'A' to compare against? This is where the Generated Key Column comes in. Data Services looks for the row with the maximum generated key column value and just compares against that row. So for John Smith, DS will only do the compare against the row with CUSTOMER_KEY = 3.

If you didn't set the generated key column value, then DS would compare against all 3 rows for John Smith instead of just the one with the maximum CUSTOMER_KEY.

The history preservation transform grabs the 1 update row from the table comparison transform and generates an update to close off the old row, and issues an insert for the new row for John's new address in San Antonio.

If you are not using history preservation then the Generated Key Column is not that important, but if you are doing type 2 history, then its essential that you set it for the history preservation to work correctly.



Monday, April 21, 2014

Improving Performance - Part 7 - Whats left?

If you have implemented all the things I've written about in parts 1 through 6, then your job should now be significantly faster. In most cases you will have now covered 99% of things that will improve your jobs performance, but if you want that little bit extra, then here are a few extra things to consider:

Join ranking

You can define the order in which Data Service will join tables. Tables with higher join ranks are joined before tables with lower join ranks. Typically you want your largest table to have have higher ranks that your smaller tables.

You set the join rank in the source table.




This typically only has an effect if the join is being done on the Data Services job server, in other words, Pushdown SQL is not happening. For a full and detailed explanation of how this works consult the Performance Optimzation Guide for DS.

Array fetch size

If you have a powerful set of hardware in place, try raising this value and see what effect it has. Interestingly, even the DS Performance Optimisation Guide just recommends increasing and decreasing the value to see what effect it has, and just going with the value that seems to get you the best performance.

Array fetch size is also set on the source table.


Rows per commit

SAP recommends that you set this at between 500 and 2,000. Again, I just recommend changing the values up and down to see which seems to have the best effect.

Two things to consider:
  1. If you have Long datatype you are loading, then the commit size will be 1 (Does not apply to Oracle)
  2. If you are using an overflow file, and an error occurs resulting in a rejected row, then the commit size will also revert back to 1. I'll be doing a post soon on the pros and cons of using the overflow file vs using the validation transform.
Committing one row at a time will be slower than committing 1,000. So consider whether you can load the Long differently (perhaps in a large Varchar field) and decide whether you really need an overflow file.

Chat to your DBA

Seriously, some of them are actually nice people :-) Now that you've fully tuned what you can on DS, the DBA can look at making improvements in the source and target databases. 

Database performance tuning can be different depending on which database you are using, and being an expert in performance tuning on all the databases that DS can connect to would be quite a feet!

So, if you have a Data Flow that is still running too slow then just go to Validation - Show Optimized SQL, and send it over to your DBA to see if she can work some of that DBA magic.

Anything else?

If you've found something that has significantly improved the performance of one of your Data Services jobs that I did not mention in this series, then please let every-one know in the comments.


Monday, April 14, 2014

Improving Performance - Part 6 - Using the lookup_ext function

The lookup_ext function is often used to lookup foreign keys for your fact table, and to return a default if no value can be found.

One setting that is often overlooked when using the function is the caching method, labeled as Cache spec.


There are 3 options you can choose from:
  1. DEMAND_LOAD_CACHE
  2. NO_CACHE
  3. PRE_LOAD_CACHE
Most people just leave the default (PRE_LOAD_CACHE) in place. But what is the difference between the 3 options?

NO_CACHE - This option reads the values from the lookup table for each individual row, but does not cache the value. 

PRE_LOAD_CACHE - This option will load the entire lookup table into memory first before looking up the values you need.

DEMAND_LOAD_CACHE - This option will load the individual rows into memory as they are identified. So the first time DS needs a row it looks into the database, finds it and places it in memory. The second and subsequent times it looks into its own memory for the value.

So which option should you use, and what impact on performance will this have?

DEMAND_LOAD_CACHE - Use this option when you have a small number of input rows, but the table you are looking up values in is large. In this instance, using DEMAND_LOAD_CACHE over PRE_LOAD_CACHE can have a huge impact on performance.

Here is a great real life example. I recently had to run a data flow where the source table had 42 records, and the lookup table had 3,390,904 records.


I ran the job once with PRE_LOAD_CACHE to see how long it would take:


I then ran the job with DEMAND_LOAD_CACHE to see the difference:


As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.

PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows in the lookup table.

I needed to use the same 3 million row table as a lookup table again, but this time the source table had 161,280 rows.


This time I ran it with DEMAND_LOAD_CACHE first:




Then I ran it with PRE_LOAD_CACHE:

In this case using PRE_LOAD_CACHE was the better choice as my source table had a lot of values when compared to the lookup table, so in effect I was looking up a far greater number of rows in the lookup table than I was in the previous example.

NO_CACHE - The only reason I can think of to use NO_CACHE is if you are running short on memory and are only accessing very few rows in the lookup table.

So as you can see from the details above, choosing the right cache option can have a huge impact on your jobs performance.

In summary the rule is, if you are going to access most of the rows in your lookup table, then use PRE_LOAD_CACHE, if you are only going to access a few rows, then use DEMAND_LOAD_CACHE. 

And if you can't figure out which one to use? Well just run the job once with DEMAND_LOAD_CACHE set, and once with PRE_LOAD_CACHE set and see which one runs the fastest for you :-)