Monday, December 1, 2014

Dynamic Where Clause

Special shout out to Michael Johnson for working on this solution with me. Here is his LinkedIn profile.

We recently came across a requirement that could be solved by doing 1 of 2 things.
  1. Create a huge case statement
  2. Use a dynamic where clause
The case statement idea seemed the easiest, but would have been around 100 cases long, and if any changes needed to be made, it would require changing the ETL and re-releasing the code.

Instead we decided to store the where clauses in a spreadsheet, and then use that to drive output rather than the case statement. We could have gone with a table, and then built a form on top of it, but a spreadsheet was the simplest and quickest solution.

Here is an example of what we were dealing with:








Our group dimension was defined by the component column, and went on for an additional 94 groups. Because each value definition in this dimension was not based on the same set of columns we couldn't use the standard lookup_ext function, or a join. 

We did consider using a case statement, but then that would have been huge, and it was anticipated that changes would be made to these definitions and we didn't want to have to re-release the ETL every time a change was made.

Getting dynamic sql into your Data Services query is done by using the pushdown_sql function, and feeding in whatever sql you want it to use.

The challenge for us was to make sure that we went through each of the rows in our table, and applied the where clauses one by one.

The first thing we did was copy all the dimension data into a working table, and added an extra column called processed and defaulted all the values to N. We did this so that we'd have a list to step through and mark off each time we'd processed a record.







When then setup a loop to go through each of our rows and process them.


The script Processed_Count initialised the counter variable for our loop.

$G_ProcessedCount = sql('Datawarehouse','select count(*) from WK_DIM_GROUP');

The inside of our loop looks like this:


($G_ProcessedCount >= 1) is there as our loop control. Once that variable is 0 we have processed all the dynamic where clauses in our list.

The Set_Variables script is where we setup the dynamic where clause, and the value that will be assigned to the rows that filter through the where clause.

$G_GroupID = sql('Datawarehouse','select min(GROUP_ID) as GROUP_ID from WK_GROUP where PROCESSED = \'N\'');

$G_GroupLogic  = sql('Datawarehouse','select a.component from WK_GROUP a where a.GROUP_ID = [$G_GroupID]');

$G_GroupID will hold the key that we want to assign to the rows. We select the min because we want the first row that has not been processed yet. As we step through the rows and update the processed rows to Y, we will end up moving to the subsequent N rows until all of them have been processed and updated to Y.

$G_GroupLogic will hold the where clause you need to filter down the rows.

Now all we need to do is apply that in a query.

In the above picture that is done in the data flow called DF_WK_GROUP_ID_Load.

pushdown_sql('Datawarehouse', '{$G_GroupLogic}' ) becomes the where clause for our query. 

$G_GroupID is mapped to the column that holds the dimension table foreign key.

Now you just need to update the row you have processed in the working table from N to Y. We do that in the DF_WK_GROUP_Update data flow.

Finally, update the $G_ProcessedCount variable in the Update_CountVariable script:
$G_ProcessedCount = $G_ProcessedCount -1;

We'll keep on going through this process until there are no more rows to process.

Monday, November 24, 2014

Files as data sources - Part 5 - Missing Files

To handle files that don't arrive I check through the SYS_FILE table to see if there is an entry for that file for the day. If there is no entry I add a row to the SYS_FILE table to record the fact that no file arrived, and then I send out an email to person responsible for that file, to let them know that the file did not arrive.

Here is the basic script:


$L_EmailBody  = 'The following files did not arrive today: ';
$L_Counter = 0;

#Check if file $G_ActivityEmfe  has arrived
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
   $L_FileName = replace_substr( $G_ActivityEmfe ,'*','%');
END

IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION = {$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) = trunc(SYSDATE)'))
begin
$L_EmailBody  = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmfe;
$L_Counter = $L_Counter + 1;
end


#Check if file $G_ActivityEmms has arrived
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
   $L_FileName = replace_substr( $G_ActivityEmms ,'*','%');
END

IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION = {$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) = trunc(SYSDATE)'))
begin
$L_EmailBody  = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmms;
$L_Counter = $L_Counter + 1;
end

#If any files did not arrive, ie counter>0, then send email. if ($L_Counter > 0)
begin
    print($L_EmailBody);
smtp_to($G_AbsentReportingAddresses, 'Absent Files Report', $L_EmailBody , 0, 0);
end

Monday, November 17, 2014

Files as data sources - Part 4 - File Errors

I always place the dataflow that is reading the file within a try...catch block to specifically respond to any file errors that may occur.


The two error types you want to look out for are:

  1. Flat file processing errors (1004)
  2. File access errors (1005)
If an error does occur I run the following in the script:


#Record file failure in SYS_FILE table
FileFailed($G_FileLocation,$G_ActivityEmms);

#Archive file
FileMove($G_ActivityEmms, $G_FileLocation, $G_FileErrorLocation);

#Truncate staging table to prevent error data from getting into datamart
sql( 'Datawarehouse','TRUNCATE TABLE STG_ACTIVITY_EMMS_EM');

The FileFailed function has nearly the same code as the FileComplete function. You could actually combine them into a single function that updates the file status to either COMPLETE or FAILED, or anything else you might need.


If (db_type('Datawarehouse') = 'Oracle')
BEGIN
    $L_FileName = replace_substr( $P_FileName,'*','%');
END

sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'FAILED\' WHERE FILE_LOCATION = {$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');

We've already reviewed how the FileMove function works.

Lastly I clear out the staging table so that no partially loaded data goes through the rest of the load.

It would also be a good idea to send out an email alerting the owner of the file that there was something wrong with the file so that they can take an remedial action before resending the file for processing.

Monday, November 10, 2014

Files as data sources - Part 3 - Moving files around

So now that file has been processed, lets look at what we should do next.

I like to have an archive folder on the file system to keep any files that have been processed, this is so that I can go back for auditing and/or dealing with any queries about the data that may come up.

I also update the STATUS field in the SYS_FILE table that references the file to COMPLETE.

Lets start with the update function:

If (db_type('Datawarehouse') = 'Oracle')
BEGIN
    $L_FileName = replace_substr( $P_FileName,'*','%');
END

sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'COMPLETE\' WHERE FILE_LOCATION = {$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');

For this function I pass in the directory and file name, look up the relevant row\s in the SYS_FILE table and then update them to COMPLETE.

First thing I do is check the database type. If I am using Oracle I will need to convert any *s to % as that is the wildcard character in oracle. You can add additional database types in here as required.

I look for the rows in the SYS_FILE table that match the file name. The reason I update more than one row at a time is because when you have multiple files processed through the file object such as File1, File2 and File3, and you refer to the file name in your file object as File*; then these files are collated into one batch for processing. So if the dataflow completed successfully then all of the files have been loaded.

Once I've update the rows in the SYS_FILE table I move the files to the archive directory with a little FileMove function:

$L_FileName = word_ext( $P_FileName ,1,'.');

exec('sh',' -c "mv ' || $P_SourceDirectory || $L_FileName  || ' ' || $P_TargetDirectory || '"',8);

In this example my job server is running on a UNIX OS, but you could re-write this a Microsoft OS by replace the 'sh' with cmd, and using 'move' instead of 'mv'.

I use the word_ext function at the top to avoid any issues with file extensions and wildcards. You may not have this issue so will not necessarily have to include the first line of the function.




Monday, November 3, 2014

Files as data sources - Part 2 - Keeping track

In the last post I spoke about how to time your job with the file's arrival. In this post I will talk about how keep track of your files by keeping some basic metadata, and also how to relate the data in your target destination back to the original file.

Here again is the IsFileAvailable function I am using to wait for the files:

$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1, $L_FileNames, $L_FileListSize, ',');

if ($L_FileFound = 1)
begin
   $L_Counter = 1;
   WHILE ($L_Counter <= $L_FileListSize)
   begin
      FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
      $L_Counter = $L_Counter + 1;
   end
end

Return $L_FileFound;

Once I have found a file, or files, I then run a routine to step through each of the file names. I use the word_ext function to choose which file name in the list I want to access, and then a counter to step through the list.

For each file name I then call a function I wrote called FileNew:

#Get time the file was created
$L_FileDateTime = get_file_attibute($P_FullFilePathAndName, 'date_created')

#Get the file type eg .txt .csv .pipe
$L_Type = word_ext($L_FileName,2,'.');

#Get new file id
$L_File_ID = sql('Datawarehouse','Select max(FILE_ID) from SYS_FILE') + 1

#Insert a new record into the SYS_FILENAME table
SQL('Datawarehouse','INSERT INTO SYS_FILE (FILE_ID, DI_FILENAME, FILE_DATETIME, FILE_PROCESS_DATETIME, TYPE, STATUS, AUDIT_ID)
VALUES ([$L_File_ID]{$P_FullFilePathAndName}, to_date({$L_FileDateTime}, \'YYYY/MM/DD hh24:mi:ss\'), sysdate,  {$L_Type},  \'STARTED\', [$G_AuditID])');

Return $L_File_ID;

What I am doing here is firstly gathering some metadata about the file such as the date it was created and the type of file it is, and then I am inserting that data in a table called SYS_FILE which is where I keep track of all the files processed. I also assign a file_id by adding 1 to the max file_id currently in the table.

I set the status of the file to STARTED, to signify that we have started processing this file. Later on I will either set the file to COMPLETE or ERROR.

Here is what the data in the file looks like:


You could always expand the number of columns in the SYS_FILE table to store all sorts of additional metadata about the file such as its source, how it was delivered or who is responsible for it.

Now that I have created a record in the SYS_FILE table for a file, how do I link that to the records that are coming through from the file?

First of all you need to go into your file object and set the source information on the file as follows:


There are similar options on XML and Excel files too.

What this does is it adds an extra column at the end of your file that will store the file name along with its path. This is particularly important if you are receiving and processing multiple files at once because DS will batch process all the files through at the same time. So lets say you are looking for File* and the files received are File1, File2, and File3. DS will process all 3 files together at once. It will include the file name that a row of data came from in the DI_FILENAME column, and we will use that information to tie the data back to the specific row we have created for that file.

In order to tie up the row of data with the correct SYS_FILE entry, we use thelookup_ext function as follows:


So now that we have our file and have processed it, we need to decide what to do with the file. I'll be covering archiving and updating the SYS_FILE table to show the completed files in the next post.


Monday, October 27, 2014

Files as data sources - Part 1 - Waiting for the files to arrive


In this post I am going to take a look at how to handle file arrival.

You have a few options for dealing with file arrival.
  1. Run the job at a set time each day. If the file is there it gets processed, and if it isn't, you just try again tomorrow.
  2. Run the job frequently, say every couple of minutes. Each time the job runs it attempts to process the file. If the file is there is gets processed, and if it isn't Data Services just tries again on the next run. This method is useful if the file could arrive multiple times a day, and you can't be sure exactly when the file will arrive.
  3. Start the job, and have the job wait for the file to arrive. This method is best if you're processing just one instance of the file everyday, but can't be exactly sure when the file will arrive, and you want the ETL run as soon as it arrives.
For dealing with all three options above I have written a function called IsFileAvailable that makes use of the built in DS function called wait_for_file. The reason I don't use the wait_for_file function alone is because I have built in additional functionality that will enable me to deal with some of the other issues I mentioned in the introduction. I will be going through those in subsequent posts.

Here is what the functions looks like:

$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1, $L_FileNames, $L_FileListSize, ',');

if ($L_FileFound = 1)
begin
   $L_Counter = 1;
   WHILE ($L_Counter <= $L_FileListSize)
   begin
      FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
      $L_Counter = $L_Counter + 1;
   end
end

Return $L_FileFound;

The above function first waits for the file\s to arrive, and then writes a record in to a file handling table using the FileNew function.

I am using the wait_for_file function to determine when the file arrives.

The return values from the this function are:
    0 - No file matched
    1 - At least one file was matched
   -1 - The function timed out with no files found
   -2 - One of the input values is illegal

I'm generally only interested if a file has been found ie if the value 1 is returned.

The first few parameters are fairly straight forward.

Firstly it needs to know the name and location of the file you are waiting for. This can contain a wildcard, so if you are waiting for any file that starts with the letters file, you can set that value to be file*.txt. If you are not certain of the extension you can also have it be file*.*, and if you don't even care what the file name is, as long as any file arrives you can set the value as *.* .

The next parameter is how long you would like Data Services to wait for the file to arrive, the timeout parameter. This is set in milliseconds, so if you want Data Services to wait 30 minutes for the file to arrive, then that value should be 30 * 60 (seconds) * 1,000 (milliseconds) to get the value 1,800,000. If the timeout duration expires, 30 minutes in this example, then the wait_for_file function will return the value -1. This means that it looks for the file for 30 minutes, but no file arrived.

The 3rd parameter is how often you want Data Services to check whether the file has arrived. Again its the same formula for setting the value. If you want it to have a look ever 5 minutes then its 5 * 60 * 1,000 to get 300,000.

The next 4 parameters are all about returning the names of the files that Data Services finds.

In this example I have -1 set for the max match parameter. This means that I want DS to return the names of all the matched files that it finds. You could set this to 0 if you don't want any of them, or any other positive number if you only want a specific number of file names returned.

The next parameter is an output parameter that will store the list of file names returned. So lets say you set the 1st parameter in the function to file*.txt, and then there are 3 files in the directory: file1.txt, file2.txt and file3.txt. This variable will hold all three of those file names.

The next parameter will return the number of files found that match the search pattern. So again if you're looking for file*.txt, and 3 files are found that match file*.txt, then this output parameter will return the value 3.

The final parameter in the function allows you to set the list separator for the list of file. In this example I set it to be a comma. So the variable I have above called $L_FileNames, will end up with the values File1.txt, File2.txt, File3.txt.

The next part of the IsFileAvailable function loops through the list of file names and calls another function I have written called FileNew for each of the file name values in the list. The purpose of the FileNew function is to write a record into my SYS_FILE table for each file found.

I'll be going through the purpose of the SYS_FILE table and how you can use it to tie up data in the target table to the source files in my next post.

Monday, October 20, 2014

Files as Data Sources - Introduction

When files are the source of your data there are a few things to take into consideration:

  • When will the files arrive?
  • How will you keep track of the files processed?
  • How do you relate data in your target tables with the original file?
  • What if multiple files arrive for the same data set?
  • What should you do with files once they've been processed
  • What happens if there is an error with the file?
  • What should you do if a file doesn't arrive?
For this blog series I am going to go through all the points above. Most of what I discuss will be relevant for both standard text files as well as xml and Excel files.

Here is the standard structure I use when processing a file through Data Services:




So firstly I used a conditional to check whether a file has arrived or not. I've created my own function, IsFileAvailable, to look out for the file arrival. Yes, I could have used the built in function wait_for_file, but there is additional functionality that you might find useful that I have built into IsFileAvailable.

If the file arrives I print that out to the trace file. You don't have to do this, but I just find it easier to see whats going on that way.

Then I place the data flow that will process the file within a try catch block. This is so that I can handle any file errors without bringing down the entire job. Within that error handler I can report the file errors to the owner\s of the file and move the file to an error file location.

In the else section of the conditional I place some code to handle the file not arriving.

Over the next few posts I'll break out the detail of how each of the above pieces works.


Monday, May 12, 2014

History preservation and using the filter in the table comparison transform

I just found another really handy use for the new filter field in the table comparison transform in the latest versions of Data Services.

If you are loading a dimension, and you are using history preservation, add a line in the filter field to only bring through current records.


This helps you with two things, one obvious, one not so obvious.

The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only need to compare against the current record of a given dimension value. Setting the generated key column makes sure the comparison only takes place against the most recent row, but the table comparison transform still has to go through all the records to get there. If you filter only for the current records, then you can compare against a smaller record set and get through it just a little bit quicker.

Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of what you had previously. It may not get you that much of a performance gain, but when you are trying to get your jobs down to near real time, then every second counts.

The second thing this helped with is dealing with expired historical data that becomes current again. Let me give you and example.

This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with a record in the asset team table that looked something like:


Team A became active again on 1-JAN-2014.

The problem I had was that before I put the filter in the table comparison to only bring through current records, the table comparison transform was ignoring Team A. It was doing this because it would look in the target table for a record with a natural key of A. It found the old record there, and because the effective dates are set by the history preserving transform, it was ignoring it because everything else was the same. I needed it to process this record as an insert so that the history preservation transform would process it correctly with the new effective date and CURR_IND equal to 'Y'.

By putting the filter in, I was filtering out the old Team A record from my comparison data set. That way the table comparison saw this as a new record and put it through as an insert. This is a fairly unique case where you have historical records for a dimension value but none of them are current. 

I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I had I think the best workaround would have been to create a view where you filter for current records, and then use that as the comparison data set.

Monday, May 5, 2014

SQL Transform - When should you use it?

Never!

You were expecting a longer post about this? Well alright then, never ever!

Whenever I see tons of SQL Transforms around a Data Services job, it is normally a sign of a lazy developer. Some-one who had written out the queries in SQL, and instead of building proper data flows using the Query transform and other built in Data Services functionality, just copied and pasted the sql into the SQL transform.

But what's wrong with that? Well plenty!

DS is NOT a glorified scheduler

For starters, you just bought a not-inexpensive ETL tool that you are now just using as a query scheduler. If all you want to do is automate your queries then write them into a  stored procedure and schedule it on the database. Don't waste your money on a world class ETL tool.

You've just broken Impact and Lineage analysis

One of the biggest selling points of using a tool like DS, is you can visually see where a field is used in your ETL and where its data ultimately ends up. Likewise, you can see where a particular piece of data came from by tracing it all the way back to the source system.

This is a fantastic governance tool for managing your warehouse. You can see if, and where, making a change to a source system will affect your data warehouse.

The second you put a SQL transform in the way, you've just broken your impact and lineage analysis. DS treats it as a black box, so whether you are using the lineage in the management console, Data Services Designer View Where Used function, or in Information Steward, you have just rendered it useless.

In the Datastore below it appears is if none of the tables are being used in any jobs.


But then after a little searching I find an SQL transform.


I look inside it and what do I find?


And this query is hardly complex. It wouldn't have taken very long to develop as a proper data flow with the Query transform, and now some future developer could make changes to the job thinking that they have everything covered, but not realize that 3 extra tables were hidden inside an SQL transform.

I can't tell you how many times I've needed to make a change, right clicked on the table in the data store and chosen View Were Used, made all my changes, only to later discover I've missed one that a developer hid in a SQL transform.

Squinting at code

One of the great things about using a GUI based ETL tool is that you can open a data flow and immediately get an idea of what it is doing. You can see the tables on the screen, see which transforms they flow through and understand what is happening to the data and where it is going. With a SQL transform you have to open it up, squint at the code to try and figure out what it is up to.

For simple SQL that's not a big deal, but a complicated query with plenty of tables and joins.... well now you're wasting my time, and my client's money too!

Should you really never use the SQL transform?

I worked in a company where they had a ban on using the SQL transform. Turn in your code with one in it, and it got returned to you to have it rewritten without the SQL transform. No exceptions.

I will admit, there are times when you will just have to use the SQL transform, but these should be rare exceptions. Sometimes you need to do something very complex, that's maybe only available on the database, so you have to use the SQL transform to take advantage of it.

Before you do it though, think really hard about whether you couldn't achieve the same thing using Data Services built in transforms. Even if you have to split it out over multiple data flows, it will still be better than using the SQL transform.

Monday, April 28, 2014

Generated Key in the Table Comparison Transform

When I first learnt how to use Data Services, I knew that I was supposed to set the Generated Key Column in the table comparison transform, but I never really thought about I why I was supposed to do that.

So lets take a look at the example above. In this example I have a dimension for Customer with a data warehouse key of CUSTOMER_KEY. This is my generated key column. I also have a field for the natural key that comes from the source system and this is CUSTOMER_NK.

I have been maintaining history on this table, so individual CUSTOMER_NK values can appear in there multiple times. Once for each row of history maintained. So if a customer has lived in more than one city, there will be a row in the customer table for each city the customer has lived in.

So data in the table might look something like this:


If a new row for John Smith arrives, lets say he moves to San Antonio, how will Data Services know which of the 3 rows with a CUSTOMER_NK of 'A' to compare against? This is where the Generated Key Column comes in. Data Services looks for the row with the maximum generated key column value and just compares against that row. So for John Smith, DS will only do the compare against the row with CUSTOMER_KEY = 3.

If you didn't set the generated key column value, then DS would compare against all 3 rows for John Smith instead of just the one with the maximum CUSTOMER_KEY.

The history preservation transform grabs the 1 update row from the table comparison transform and generates an update to close off the old row, and issues an insert for the new row for John's new address in San Antonio.

If you are not using history preservation then the Generated Key Column is not that important, but if you are doing type 2 history, then its essential that you set it for the history preservation to work correctly.



Monday, April 21, 2014

Improving Performance - Part 7 - Whats left?

If you have implemented all the things I've written about in parts 1 through 6, then your job should now be significantly faster. In most cases you will have now covered 99% of things that will improve your jobs performance, but if you want that little bit extra, then here are a few extra things to consider:

Join ranking

You can define the order in which Data Service will join tables. Tables with higher join ranks are joined before tables with lower join ranks. Typically you want your largest table to have have higher ranks that your smaller tables.

You set the join rank in the source table.




This typically only has an effect if the join is being done on the Data Services job server, in other words, Pushdown SQL is not happening. For a full and detailed explanation of how this works consult the Performance Optimzation Guide for DS.

Array fetch size

If you have a powerful set of hardware in place, try raising this value and see what effect it has. Interestingly, even the DS Performance Optimisation Guide just recommends increasing and decreasing the value to see what effect it has, and just going with the value that seems to get you the best performance.

Array fetch size is also set on the source table.


Rows per commit

SAP recommends that you set this at between 500 and 2,000. Again, I just recommend changing the values up and down to see which seems to have the best effect.

Two things to consider:
  1. If you have Long datatype you are loading, then the commit size will be 1 (Does not apply to Oracle)
  2. If you are using an overflow file, and an error occurs resulting in a rejected row, then the commit size will also revert back to 1. I'll be doing a post soon on the pros and cons of using the overflow file vs using the validation transform.
Committing one row at a time will be slower than committing 1,000. So consider whether you can load the Long differently (perhaps in a large Varchar field) and decide whether you really need an overflow file.

Chat to your DBA

Seriously, some of them are actually nice people :-) Now that you've fully tuned what you can on DS, the DBA can look at making improvements in the source and target databases. 

Database performance tuning can be different depending on which database you are using, and being an expert in performance tuning on all the databases that DS can connect to would be quite a feet!

So, if you have a Data Flow that is still running too slow then just go to Validation - Show Optimized SQL, and send it over to your DBA to see if she can work some of that DBA magic.

Anything else?

If you've found something that has significantly improved the performance of one of your Data Services jobs that I did not mention in this series, then please let every-one know in the comments.


Monday, April 14, 2014

Improving Performance - Part 6 - Using the lookup_ext function

The lookup_ext function is often used to lookup foreign keys for your fact table, and to return a default if no value can be found.

One setting that is often overlooked when using the function is the caching method, labeled as Cache spec.


There are 3 options you can choose from:
  1. DEMAND_LOAD_CACHE
  2. NO_CACHE
  3. PRE_LOAD_CACHE
Most people just leave the default (PRE_LOAD_CACHE) in place. But what is the difference between the 3 options?

NO_CACHE - This option reads the values from the lookup table for each individual row, but does not cache the value. 

PRE_LOAD_CACHE - This option will load the entire lookup table into memory first before looking up the values you need.

DEMAND_LOAD_CACHE - This option will load the individual rows into memory as they are identified. So the first time DS needs a row it looks into the database, finds it and places it in memory. The second and subsequent times it looks into its own memory for the value.

So which option should you use, and what impact on performance will this have?

DEMAND_LOAD_CACHE - Use this option when you have a small number of input rows, but the table you are looking up values in is large. In this instance, using DEMAND_LOAD_CACHE over PRE_LOAD_CACHE can have a huge impact on performance.

Here is a great real life example. I recently had to run a data flow where the source table had 42 records, and the lookup table had 3,390,904 records.


I ran the job once with PRE_LOAD_CACHE to see how long it would take:


I then ran the job with DEMAND_LOAD_CACHE to see the difference:


As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.

PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows in the lookup table.

I needed to use the same 3 million row table as a lookup table again, but this time the source table had 161,280 rows.


This time I ran it with DEMAND_LOAD_CACHE first:




Then I ran it with PRE_LOAD_CACHE:

In this case using PRE_LOAD_CACHE was the better choice as my source table had a lot of values when compared to the lookup table, so in effect I was looking up a far greater number of rows in the lookup table than I was in the previous example.

NO_CACHE - The only reason I can think of to use NO_CACHE is if you are running short on memory and are only accessing very few rows in the lookup table.

So as you can see from the details above, choosing the right cache option can have a huge impact on your jobs performance.

In summary the rule is, if you are going to access most of the rows in your lookup table, then use PRE_LOAD_CACHE, if you are only going to access a few rows, then use DEMAND_LOAD_CACHE. 

And if you can't figure out which one to use? Well just run the job once with DEMAND_LOAD_CACHE set, and once with PRE_LOAD_CACHE set and see which one runs the fastest for you :-)

Monday, April 7, 2014

Improving Performance - Part 5 - Table comparison transform

How you use the table comparison transform can have a huge impact on the performance of your job.

There are 3 options available for the comparison method used by the transform:

  • Row-by-row select
  • Cached comparison table
  • Sorted input
Use row-by-row select when you have very few rows to compare in relation to the size of your target table. Doing the individual comparisons for a few dozen rows will be significantly quicker than a full table comparison against a large table.

But, if you have a lot of rows to compare in relation to the size of the table, then don't use this method as it will take significantly longer.

If you have large number of rows to compare with the target table, then use either cached comparison table or sorted input. In general I find that sorted input is preferable because it seems to be quicker, and also it doesn't require as much memory to work. When it is not possible to sort the table for comparison then that leaves you with cached comparison table as your only option.

You may have noticed that the table comparison transform now has a place where you can add a filter to the rows from the target table. Lets say know that you are only getting sales figures for 2014 coming through, and therefore you only need to compare against data in the table for 2014, then add that to the filter box on the transform.


Now you'll only be comparing against the 2014 subset of the data. If you have many years of data in your target table this can also result in a massive increase in performance.

A word of caution though, be very sure that you are applying the filter correctly, because if you filter too much out of the target table, you may get false inserts.

Sometimes when you have chosen the appropriate the comparison method the time to process all the rows is still too slow. In this situation I put an extra step in the process to reduce the number of rows I need to compare against, and this can often give me as much as a 90% increase in performance.

To do this method you need to have 3 data flows, as opposed to just the one you'd normally need, but even though you'll now have 3 data flows, the total process will still be much quicker.

Lets take the example of where you are trying to load a sales fact table. The sales fact table has over one hundred million rows in it, and you are trying to add another ten thousand. The table comparison is taking ages as DS is comparing ten thousand rows to one hundred millions rows in the fact table.

Step 1 - Create a data flow where you do all the processing you need to do to load your target table, but instead of loading your ultimate target table, just load the data to a new table. I often prefix these tables with DT (for Data Transfer). This is the temporary holding place for the data you will ultimately load into the FACT_SALES table. 


Step 2 - Create a comparison data set. To do this, join the DT table to the actual target table and load the common rows into another table we will call the comparison table. So here I will join FACT_SALES to DT_FACT_SALES and load all the matching rows from FACT_SALES into the COMPARE_FACT_SALES table. 


Step 3 - Now use the COMPARE_FACT_SALES table in your table comparison as the compare table instead of the FACT_SALES table. Instead of comparing against all one hundred million rows, the table comparison will now only need to compare against no more than ten thousand rows. When it finds a row it will do an update, and where it doesn't, well that will be an insert. Then load what it finds into your fact table.


So your final data flow looks like this.



This method is best used when the number of incoming rows is significantly smaller than the size of the target table. The smaller the number of input rows in comparison to the size of the target table, the greater the performance improvement.

On the other hand, if you actually need to compare your input with most of the rows in the target table then it could actually be slower to use this method as you have to build in the time in takes to create the DT and COMPARE tables into the process.

When used appropriately, this method can make a massive difference to the overall performance of your job,.






Tuesday, April 1, 2014

Improving Performance - Part 4 - Bulk loading

Bulk loading is a great way to quickly load a large amount of data into your warehouse. The extra speed is achieved because the database will not log each individual row or apply row locks while the data is being loaded.

I find bulk loading especially useful when loading staging tables. As I typically truncate and load my staging tables I am just interested in getting my data in there as fast as possible, with the more complicated data processing taking place in subsequent data flows. So as I am just dumping the data into the staging tables I don't care about logging or row locking.

There is one situation where I tend to avoid bulk loading, and that is where the source and target tables are in the same database, and Data Services is able to push down the entire INSERT..SELECT statement to the database. This is because it tends to be much quicker to let the database run this internally, because when you turn on bulk loading all the source data is brought onto the server, and then pushed over to the target, regardless of whether the source and target are from the same datastore or not.

To use bulk loading double click on the target table in your data flow, and then click on the Bulk Loader Options tab.


Under the bulk load option change it from none, and then continue to set the rest of the settings to achieve the best performance for the database platform you are using. In this case I was using Oracle, so the options on this tab will be different depending on which database you are using.

You may need to work with your DBA to make sure that you database user has sufficient privileges to run bulk loading.

I have found that using bulk loading can have up to a 50% improvement in speed, which is especially noticeable when you are loading large tables.

Monday, March 31, 2014

Improving Performance - Part 3 - Parallel processing

Running at the same time instead of in sequence can sometimes be one of the biggest gains in performance you can make to your job.

Imagine you have 3 data flows of 10 minutes each. If you run them sequentially it will take 30 minutes, but if you run them at the same time it would only take 10 minutes in total.... provided of course that your job server has sufficient resources to run all 3 at the same time.

Now you can either go about working out on your own which data flows can be run sequentially or in parallel, or you can let Data Services work it all out for you.

To get Data Services to do the hard work for you, read my blog post on developing Data Services jobs using components and how the execute only once option helps you out.


Friday, March 28, 2014

Improving Performance - Part 2 - Use the database

One of the clever things that Data Services can do, is it can send processing down to the source database rather than doing it itself on the job server. We call this pushdown SQL, because the query is pushed down to the database.

Why is this a good thing?

Imagine a simple query of 2 tables. 1 table has 1 million rows, this is joined to a second table with 10 rows, and then you put a filter on the 10 row table. So maybe the large table is sales figures, and the small table is cities that you do business in, and you only want sales from 1 city.

Without pushdown SQL Data Services would first have to retrieve all 1 million sales records from the database and that data would have to be moved over the network to the job server. Then the 10 city records would be retrieved and it would have to filter the 10 city records to the city you want, and then would have to join that back to sales to get the 100,000 or so sales records you actually want.

With pushdown SQL, the query will be sent to the database where the database will run the query to retrieve the 100,000 records required and just those will be sent over the network to the job server.

Now lets say you were just going to move those records to another table in the same database, well in that case DS can create an insert query that will get run on the database directly, so no data at all needs to be moved over the network.

So how do you know whether data services is pushing the query down to the database or not? Well in the data flow go to the menu heading Validation and then the sub-menu Display Optimized SQL.


A window will then open showing you the actual SQL that will be run against the source database.


So what are some of the techniques you can use to make sure that as much as possible is being pushed down to the database?

Well first of all, don't create unnecessary datastores. I've often seen people create datastores for different groups of tables that all come from the same database. When you do this DS thinks the tables are in different databases and doesn't realize that it can join them together on the source.

If you are using Oracle, just create one datastore per database, that way you can take advantage of cross schema joins.

Also consider using schemas in SQL Server to logically separate out pieces of your database such as having an stg, and a dm schema instead of placing them in separate databases.

Anything which enables to DS to contain as many of your tables in one datastore as possible, will improve its ability to push down sql to the database.

Within your datastores you can also tell DS that you have linked databases setup, this way DS can take advantage of any linked server you have setup as well. Just go to your datastore, click edit, and the last field under advanced is where you tell it that you have a linked server setup to another datastore.

Also look at how you are building your dataflows. If you have a long complex dataflow that joins lots of stuff together, then does a bunch of processing, and then joins in another large table; consider splitting this out into two dataflows. Send the data back to the database mid-stream and then pick it up again. Often it turns out that having a few smaller dataflows that can be pushed down to the database runs a lot faster than having one large complex dataflow where DS is pulling all the data onto the server.

Now that you've got as much as possible running on the database, you can take full advantage of all those indexes and partitions my interviewees were telling me about in part one!

Thursday, March 27, 2014

Improving Performance - Part 1 - Only process what you need

I've interviewed quite a few people for positions as Data Services consultants and one of my standard questions is -
                     "Say you have a slow job, what are some of the things you could do with Data Services to improve its performance?"

So many people start off by telling me about indexes and partitions and query optimization, and almost never talk about the way the data flows themselves are designed and the huge impact that can have on performance.

So this series of posts is going to be about how to get the best performance out of your DS jobs using functionality available to you in Data Services, starting with the things you can do that will have the most impact on performance.

I have found that the number one thing you can do to impact your jobs performance is to make sure you are only bringing though the data that you need to process in that load. I have seen so many data flows that reprocess entire tables every time they are run. Often they are this way because when the jobs were first created the source systems were fairly small, so the impact on performance was minimal. But as the source systems have grown over the years DS is having to process more and more data, and the impact on execution time becomes immense.

So how do you make sure you are only processing what you need to process? Well the first thing is to look for an insert and update datestamp on your source system data. Then use this to make sure you are only bringing though the data you need for that load. To do this, keep a record of the max update date of each load, and then only bring through data with a more recent update date the next time you run the job. This is often referred to as doing a delta load.

To give you an example of the impact this can have, just yesterday I reduced a single data flow's run time from 20 minutes down to 4 minutes by adding a check to only bring through data amended or added since the last job execution.

What if there is no update date in the source system? Well in that case you may have to look for alternatives such as keeping track of the highest primary key since last run, but this only works on inserts. Another option is to use CDC (Change Data Capture). There is documentation that comes with Data Services that explains how to use this, but basically, the database keeps a log of all the changes that occurred, and then you use the Map CDC Operation transform to load them into your warehouse. Its not entirely straightforward, so if an update date is available on the table, I always go for that option first.

Wednesday, March 26, 2014

5 Myths about SAP BI

1.       You need SAP ERP to use SAP BI

This is one of the most pervasive myths out there and it just isn't true. In the 15 years I have been working with SAP BI, I have only once used SAP's ERP system as a data source. In fact SAP BI can connect with over 160 different data sources from big data solutions like Hadoop, through the traditional databases like Oracle and SQL Server, right down to a basic Excel spreadsheet. In fact on of its greatest strengths is that it can connect to so many different data sources, and offers the ability to integrate those sources without having to do any ETL.
 

2.       It's just a reporting tool

For years Business Objects, which SAP bought in 2006, primary tool was a report writing application with excellent ad-hoc and highly formatted report capabilities, and that tool still exists today in the form of Web Intelligence. But SAP BI is so much more the Web Intelligence. It offers data discovery in the form of Explorer; highly formatted reports built with Crystal; its dashboard builder is world class and its incredibly easy to deploy any and all content to mobile. All of the reporting tools sit on top of a unified and consistent data layer, allowing for the same interface to be used for data access across the entire tool-set. Throw in SAP BI's ETL and Data Quality tool Data Services, and its impact and lineage capabilities in the form of Information Steward, and you have an end-to-end complete BI and Data Management solution.
 

3.       It's too expensive for small to medium sized companies

SAP has offered the Edge version of its BI tools for a number of years now. This provides a very cost effective solution for small to medium companies, while at the same time provides for something that is very scale-able. Some tools, like Lumira, are even free and come with cloud storage and access.
 

4.       It won't integrate with Office or SharePoint

SAP BI comes with Live Office that allows you to share any report table or graph within Outlook, Excel, Word or PowerPoint. You can also create a query directly in Excel against a Universe. A SharePoint integration kit is also available for full integration with SharePoint.
 

5.       It's difficult and expensive to deploy

Pre-built cloud solutions are available making it very easy to get an installation up and running. A standard installation can be a against a single server running a Windows OS. Having said that, a big advantage of SAP BI is that you can also install it in a clustered environment running on your choice of OS from Windows through to your favorite Unix\Linux flavor. And don't forget that Lumira Cloud is ready to go and free, you just need to create an account.