I've interviewed quite a few people for positions as Data Services consultants and one of my standard questions is -
"Say you have a slow job, what are some of the things you could do with Data Services to improve its performance?"
So many people start off by telling me about indexes and partitions and query optimization, and almost never talk about the way the data flows themselves are designed and the huge impact that can have on performance.
So this series of posts is going to be about how to get the best performance out of your DS jobs using functionality available to you in Data Services, starting with the things you can do that will have the most impact on performance.
I have found that the number one thing you can do to impact your jobs performance is to make sure you are only bringing though the data that you need to process in that load. I have seen so many data flows that reprocess entire tables every time they are run. Often they are this way because when the jobs were first created the source systems were fairly small, so the impact on performance was minimal. But as the source systems have grown over the years DS is having to process more and more data, and the impact on execution time becomes immense.
So how do you make sure you are only processing what you need to process? Well the first thing is to look for an insert and update datestamp on your source system data. Then use this to make sure you are only bringing though the data you need for that load. To do this, keep a record of the max update date of each load, and then only bring through data with a more recent update date the next time you run the job. This is often referred to as doing a delta load.
To give you an example of the impact this can have, just yesterday I reduced a single data flow's run time from 20 minutes down to 4 minutes by adding a check to only bring through data amended or added since the last job execution.
What if there is no update date in the source system? Well in that case you may have to look for alternatives such as keeping track of the highest primary key since last run, but this only works on inserts. Another option is to use CDC (Change Data Capture). There is documentation that comes with Data Services that explains how to use this, but basically, the database keeps a log of all the changes that occurred, and then you use the Map CDC Operation transform to load them into your warehouse. Its not entirely straightforward, so if an update date is available on the table, I always go for that option first.
No comments:
Post a Comment