Monday, April 14, 2014

Improving Performance - Part 6 - Using the lookup_ext function

The lookup_ext function is often used to lookup foreign keys for your fact table, and to return a default if no value can be found.

One setting that is often overlooked when using the function is the caching method, labeled as Cache spec.

There are 3 options you can choose from:
Most people just leave the default (PRE_LOAD_CACHE) in place. But what is the difference between the 3 options?

NO_CACHE - This option reads the values from the lookup table for each individual row, but does not cache the value. 

PRE_LOAD_CACHE - This option will load the entire lookup table into memory first before looking up the values you need.

DEMAND_LOAD_CACHE - This option will load the individual rows into memory as they are identified. So the first time DS needs a row it looks into the database, finds it and places it in memory. The second and subsequent times it looks into its own memory for the value.

So which option should you use, and what impact on performance will this have?

DEMAND_LOAD_CACHE - Use this option when you have a small number of input rows, but the table you are looking up values in is large. In this instance, using DEMAND_LOAD_CACHE over PRE_LOAD_CACHE can have a huge impact on performance.

Here is a great real life example. I recently had to run a data flow where the source table had 42 records, and the lookup table had 3,390,904 records.

I ran the job once with PRE_LOAD_CACHE to see how long it would take:

I then ran the job with DEMAND_LOAD_CACHE to see the difference:

As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.

PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows in the lookup table.

I needed to use the same 3 million row table as a lookup table again, but this time the source table had 161,280 rows.

This time I ran it with DEMAND_LOAD_CACHE first:

Then I ran it with PRE_LOAD_CACHE:

In this case using PRE_LOAD_CACHE was the better choice as my source table had a lot of values when compared to the lookup table, so in effect I was looking up a far greater number of rows in the lookup table than I was in the previous example.

NO_CACHE - The only reason I can think of to use NO_CACHE is if you are running short on memory and are only accessing very few rows in the lookup table.

So as you can see from the details above, choosing the right cache option can have a huge impact on your jobs performance.

In summary the rule is, if you are going to access most of the rows in your lookup table, then use PRE_LOAD_CACHE, if you are only going to access a few rows, then use DEMAND_LOAD_CACHE. 

And if you can't figure out which one to use? Well just run the job once with DEMAND_LOAD_CACHE set, and once with PRE_LOAD_CACHE set and see which one runs the fastest for you :-)

Monday, April 7, 2014

Improving Performance - Part 5 - Table comparison transform

How you use the table comparison transform can have a huge impact on the performance of your job.

There are 3 options available for the comparison method used by the transform:

  • Row-by-row select
  • Cached comparison table
  • Sorted input
Use row-by-row select when you have very few rows to compare in relation to the size of your target table. Doing the individual comparisons for a few dozen rows will be significantly quicker than a full table comparison against a large table.

But, if you have a lot of rows to compare in relation to the size of the table, then don't use this method as it will take significantly longer.

If you have large number of rows to compare with the target table, then use either cached comparison table or sorted input. In general I find that sorted input is preferable because it seems to be quicker, and also it doesn't require as much memory to work. When it is not possible to sort the table for comparison then that leaves you with cached comparison table as your only option.

You may have noticed that the table comparison transform now has a place where you can add a filter to the rows from the target table. Lets say know that you are only getting sales figures for 2014 coming through, and therefore you only need to compare against data in the table for 2014, then add that to the filter box on the transform.

Now you'll only be comparing against the 2014 subset of the data. If you have many years of data in your target table this can also result in a massive increase in performance.

A word of caution though, be very sure that you are applying the filter correctly, because if you filter too much out of the target table, you may get false inserts.

Sometimes when you have chosen the appropriate the comparison method the time to process all the rows is still too slow. In this situation I put an extra step in the process to reduce the number of rows I need to compare against, and this can often give me as much as a 90% increase in performance.

To do this method you need to have 3 data flows, as opposed to just the one you'd normally need, but even though you'll now have 3 data flows, the total process will still be much quicker.

Lets take the example of where you are trying to load a sales fact table. The sales fact table has over one hundred million rows in it, and you are trying to add another ten thousand. The table comparison is taking ages as DS is comparing ten thousand rows to one hundred millions rows in the fact table.

Step 1 - Create a data flow where you do all the processing you need to do to load your target table, but instead of loading your ultimate target table, just load the data to a new table. I often prefix these tables with DT (for Data Transfer). This is the temporary holding place for the data you will ultimately load into the FACT_SALES table. 

Step 2 - Create a comparison data set. To do this, join the DT table to the actual target table and load the common rows into another table we will call the comparison table. So here I will join FACT_SALES to DT_FACT_SALES and load all the matching rows from FACT_SALES into the COMPARE_FACT_SALES table. 

Step 3 - Now use the COMPARE_FACT_SALES table in your table comparison as the compare table instead of the FACT_SALES table. Instead of comparing against all one hundred million rows, the table comparison will now only need to compare against no more than ten thousand rows. When it finds a row it will do an update, and where it doesn't, well that will be an insert. Then load what it finds into your fact table.

So your final data flow looks like this.

This method is best used when the number of incoming rows is significantly smaller than the size of the target table. The smaller the number of input rows in comparison to the size of the target table, the greater the performance improvement.

On the other hand, if you actually need to compare your input with most of the rows in the target table then it could actually be slower to use this method as you have to build in the time in takes to create the DT and COMPARE tables into the process.

When used appropriately, this method can make a massive difference to the overall performance of your job,.

Tuesday, April 1, 2014

Improving Performance - Part 4 - Bulk loading

Bulk loading is a great way to quickly load a large amount of data into your warehouse. The extra speed is achieved because the database will not log each individual row or apply row locks while the data is being loaded.

I find bulk loading especially useful when loading staging tables. As I typically truncate and load my staging tables I am just interested in getting my data in there as fast as possible, with the more complicated data processing taking place in subsequent data flows. So as I am just dumping the data into the staging tables I don't care about logging or row locking.

There is one situation where I tend to avoid bulk loading, and that is where the source and target tables are in the same database, and Data Services is able to push down the entire INSERT..SELECT statement to the database. This is because it tends to be much quicker to let the database run this internally, because when you turn on bulk loading all the source data is brought onto the server, and then pushed over to the target, regardless of whether the source and target are from the same datastore or not.

To use bulk loading double click on the target table in your data flow, and then click on the Bulk Loader Options tab.

Under the bulk load option change it from none, and then continue to set the rest of the settings to achieve the best performance for the database platform you are using. In this case I was using Oracle, so the options on this tab will be different depending on which database you are using.

You may need to work with your DBA to make sure that you database user has sufficient privileges to run bulk loading.

I have found that using bulk loading can have up to a 50% improvement in speed, which is especially noticeable when you are loading large tables.

Monday, March 31, 2014

Improving Performance - Part 3 - Parallel processing

Running at the same time instead of in sequence can sometimes be one of the biggest gains in performance you can make to your job.

Imagine you have 3 data flows of 10 minutes each. If you run them sequentially it will take 30 minutes, but if you run them at the same time it would only take 10 minutes in total.... provided of course that your job server has sufficient resources to run all 3 at the same time.

Now you can either go about working out on your own which data flows can be run sequentially or in parallel, or you can let Data Services work it all out for you.

To get Data Services to do the hard work for you, read my blog post on developing Data Services jobs using components and how the execute only once option helps you out.

Friday, March 28, 2014

Improving Performance - Part 2 - Use the database

One of the clever things that Data Services can do, is it can send processing down to the source database rather than doing it itself on the job server. We call this pushdown SQL, because the query is pushed down to the database.

Why is this a good thing?

Imagine a simple query of 2 tables. 1 table has 1 million rows, this is joined to a second table with 10 rows, and then you put a filter on the 10 row table. So maybe the large table is sales figures, and the small table is cities that you do business in, and you only want sales from 1 city.

Without pushdown SQL Data Services would first have to retrieve all 1 million sales records from the database and that data would have to be moved over the network to the job server. Then the 10 city records would be retrieved and it would have to filter the 10 city records to the city you want, and then would have to join that back to sales to get the 100,000 or so sales records you actually want.

With pushdown SQL, the query will be sent to the database where the database will run the query to retrieve the 100,000 records required and just those will be sent over the network to the job server.

Now lets say you were just going to move those records to another table in the same database, well in that case DS can create an insert query that will get run on the database directly, so no data at all needs to be moved over the network.

So how do you know whether data services is pushing the query down to the database or not? Well in the data flow go to the menu heading Validation and then the sub-menu Display Optimized SQL.

A window will then open showing you the actual SQL that will be run against the source database.

So what are some of the techniques you can use to make sure that as much as possible is being pushed down to the database?

Well first of all, don't create unnecessary datastores. I've often seen people create datastores for different groups of tables that all come from the same database. When you do this DS thinks the tables are in different databases and doesn't realize that it can join them together on the source.

If you are using Oracle, just create one datastore per database, that way you can take advantage of cross schema joins.

Also consider using schemas in SQL Server to logically separate out pieces of your database such as having an stg, and a dm schema instead of placing them in separate databases.

Anything which enables to DS to contain as many of your tables in one datastore as possible, will improve its ability to push down sql to the database.

Within your datastores you can also tell DS that you have linked databases setup, this way DS can take advantage of any linked server you have setup as well. Just go to your datastore, click edit, and the last field under advanced is where you tell it that you have a linked server setup to another datastore.

Also look at how you are building your dataflows. If you have a long complex dataflow that joins lots of stuff together, then does a bunch of processing, and then joins in another large table; consider splitting this out into two dataflows. Send the data back to the database mid-stream and then pick it up again. Often it turns out that having a few smaller dataflows that can be pushed down to the database runs a lot faster than having one large complex dataflow where DS is pulling all the data onto the server.

Now that you've got as much as possible running on the database, you can take full advantage of all those indexes and partitions my interviewees were telling me about in part one!

Thursday, March 27, 2014

Improving Performance - Part 1 - Only process what you need

I've interviewed quite a few people for positions as Data Services consultants and one of my standard questions is -
                     "Say you have a slow job, what are some of the things you could do with Data Services to improve its performance?"

So many people start off by telling me about indexes and partitions and query optimization, and almost never talk about the way the data flows themselves are designed and the huge impact that can have on performance.

So this series of posts is going to be about how to get the best performance out of your DS jobs using functionality available to you in Data Services, starting with the things you can do that will have the most impact on performance.

I have found that the number one thing you can do to impact your jobs performance is to make sure you are only bringing though the data that you need to process in that load. I have seen so many data flows that reprocess entire tables every time they are run. Often they are this way because when the jobs were first created the source systems were fairly small, so the impact on performance was minimal. But as the source systems have grown over the years DS is having to process more and more data, and the impact on execution time becomes immense.

So how do you make sure you are only processing what you need to process? Well the first thing is to look for an insert and update datestamp on your source system data. Then use this to make sure you are only bringing though the data you need for that load. To do this, keep a record of the max update date of each load, and then only bring through data with a more recent update date the next time you run the job. This is often referred to as doing a delta load.

To give you an example of the impact this can have, just yesterday I reduced a single data flow's run time from 20 minutes down to 4 minutes by adding a check to only bring through data amended or added since the last job execution.

What if there is no update date in the source system? Well in that case you may have to look for alternatives such as keeping track of the highest primary key since last run, but this only works on inserts. Another option is to use CDC (Change Data Capture). There is documentation that comes with Data Services that explains how to use this, but basically, the database keeps a log of all the changes that occurred, and then you use the Map CDC Operation transform to load them into your warehouse. Its not entirely straightforward, so if an update date is available on the table, I always go for that option first.

Wednesday, March 26, 2014

5 Myths about SAP BI

1.       You need SAP ERP to use SAP BI

This is one of the most pervasive myths out there and it just isn't true. In the 15 years I have been working with SAP BI, I have only once used SAP's ERP system as a data source. In fact SAP BI can connect with over 160 different data sources from big data solutions like Hadoop, through the traditional databases like Oracle and SQL Server, right down to a basic Excel spreadsheet. In fact on of its greatest strengths is that it can connect to so many different data sources, and offers the ability to integrate those sources without having to do any ETL.

2.       It's just a reporting tool

For years Business Objects, which SAP bought in 2006, primary tool was a report writing application with excellent ad-hoc and highly formatted report capabilities, and that tool still exists today in the form of Web Intelligence. But SAP BI is so much more the Web Intelligence. It offers data discovery in the form of Explorer; highly formatted reports built with Crystal; its dashboard builder is world class and its incredibly easy to deploy any and all content to mobile. All of the reporting tools sit on top of a unified and consistent data layer, allowing for the same interface to be used for data access across the entire tool-set. Throw in SAP BI's ETL and Data Quality tool Data Services, and its impact and lineage capabilities in the form of Information Steward, and you have an end-to-end complete BI and Data Management solution.

3.       It's too expensive for small to medium sized companies

SAP has offered the Edge version of its BI tools for a number of years now. This provides a very cost effective solution for small to medium companies, while at the same time provides for something that is very scale-able. Some tools, like Lumira, are even free and come with cloud storage and access.

4.       It won't integrate with Office or SharePoint

SAP BI comes with Live Office that allows you to share any report table or graph within Outlook, Excel, Word or PowerPoint. You can also create a query directly in Excel against a Universe. A SharePoint integration kit is also available for full integration with SharePoint.

5.       It's difficult and expensive to deploy

Pre-built cloud solutions are available making it very easy to get an installation up and running. A standard installation can be a against a single server running a Windows OS. Having said that, a big advantage of SAP BI is that you can also install it in a clustered environment running on your choice of OS from Windows through to your favorite Unix\Linux flavor. And don't forget that Lumira Cloud is ready to go and free, you just need to create an account.