Monday, April 7, 2014

Improving Performance - Part 5 - Table comparison transform

How you use the table comparison transform can have a huge impact on the performance of your job.

There are 3 options available for the comparison method used by the transform:

  • Row-by-row select
  • Cached comparison table
  • Sorted input
Use row-by-row select when you have very few rows to compare in relation to the size of your target table. Doing the individual comparisons for a few dozen rows will be significantly quicker than a full table comparison against a large table.

But, if you have a lot of rows to compare in relation to the size of the table, then don't use this method as it will take significantly longer.

If you have large number of rows to compare with the target table, then use either cached comparison table or sorted input. In general I find that sorted input is preferable because it seems to be quicker, and also it doesn't require as much memory to work. When it is not possible to sort the table for comparison then that leaves you with cached comparison table as your only option.

You may have noticed that the table comparison transform now has a place where you can add a filter to the rows from the target table. Lets say know that you are only getting sales figures for 2014 coming through, and therefore you only need to compare against data in the table for 2014, then add that to the filter box on the transform.


Now you'll only be comparing against the 2014 subset of the data. If you have many years of data in your target table this can also result in a massive increase in performance.

A word of caution though, be very sure that you are applying the filter correctly, because if you filter too much out of the target table, you may get false inserts.

Sometimes when you have chosen the appropriate the comparison method the time to process all the rows is still too slow. In this situation I put an extra step in the process to reduce the number of rows I need to compare against, and this can often give me as much as a 90% increase in performance.

To do this method you need to have 3 data flows, as opposed to just the one you'd normally need, but even though you'll now have 3 data flows, the total process will still be much quicker.

Lets take the example of where you are trying to load a sales fact table. The sales fact table has over one hundred million rows in it, and you are trying to add another ten thousand. The table comparison is taking ages as DS is comparing ten thousand rows to one hundred millions rows in the fact table.

Step 1 - Create a data flow where you do all the processing you need to do to load your target table, but instead of loading your ultimate target table, just load the data to a new table. I often prefix these tables with DT (for Data Transfer). This is the temporary holding place for the data you will ultimately load into the FACT_SALES table. 


Step 2 - Create a comparison data set. To do this, join the DT table to the actual target table and load the common rows into another table we will call the comparison table. So here I will join FACT_SALES to DT_FACT_SALES and load all the matching rows from FACT_SALES into the COMPARE_FACT_SALES table. 


Step 3 - Now use the COMPARE_FACT_SALES table in your table comparison as the compare table instead of the FACT_SALES table. Instead of comparing against all one hundred million rows, the table comparison will now only need to compare against no more than ten thousand rows. When it finds a row it will do an update, and where it doesn't, well that will be an insert. Then load what it finds into your fact table.


So your final data flow looks like this.



This method is best used when the number of incoming rows is significantly smaller than the size of the target table. The smaller the number of input rows in comparison to the size of the target table, the greater the performance improvement.

On the other hand, if you actually need to compare your input with most of the rows in the target table then it could actually be slower to use this method as you have to build in the time in takes to create the DT and COMPARE tables into the process.

When used appropriately, this method can make a massive difference to the overall performance of your job,.






3 comments:

  1. Wow...very clear explanation...thanks bro

    ReplyDelete
  2. i agree with your approach....but i have used similer approach in my project which is similer to this...... i have joined source and target then i have identified new rows and existing rows then i have updated old rows with source data and i have inserted news rows directly...i did not use separate compare table or table comparison....is my approach not good....could you please guide me

    ReplyDelete
    Replies
    1. Your approach is fine, but if you don't use the TC to check for updates, you can be firing off updates for rows that haven't actually changed. Not a big deal, but if you are trying to keep track of true updates via some kind of auditing then it could be a problem.

      Delete