Friday, March 28, 2014

Improving Performance - Part 2 - Use the database

One of the clever things that Data Services can do, is it can send processing down to the source database rather than doing it itself on the job server. We call this pushdown SQL, because the query is pushed down to the database.

Why is this a good thing?

Imagine a simple query of 2 tables. 1 table has 1 million rows, this is joined to a second table with 10 rows, and then you put a filter on the 10 row table. So maybe the large table is sales figures, and the small table is cities that you do business in, and you only want sales from 1 city.

Without pushdown SQL Data Services would first have to retrieve all 1 million sales records from the database and that data would have to be moved over the network to the job server. Then the 10 city records would be retrieved and it would have to filter the 10 city records to the city you want, and then would have to join that back to sales to get the 100,000 or so sales records you actually want.

With pushdown SQL, the query will be sent to the database where the database will run the query to retrieve the 100,000 records required and just those will be sent over the network to the job server.

Now lets say you were just going to move those records to another table in the same database, well in that case DS can create an insert query that will get run on the database directly, so no data at all needs to be moved over the network.

So how do you know whether data services is pushing the query down to the database or not? Well in the data flow go to the menu heading Validation and then the sub-menu Display Optimized SQL.


A window will then open showing you the actual SQL that will be run against the source database.


So what are some of the techniques you can use to make sure that as much as possible is being pushed down to the database?

Well first of all, don't create unnecessary datastores. I've often seen people create datastores for different groups of tables that all come from the same database. When you do this DS thinks the tables are in different databases and doesn't realize that it can join them together on the source.

If you are using Oracle, just create one datastore per database, that way you can take advantage of cross schema joins.

Also consider using schemas in SQL Server to logically separate out pieces of your database such as having an stg, and a dm schema instead of placing them in separate databases.

Anything which enables to DS to contain as many of your tables in one datastore as possible, will improve its ability to push down sql to the database.

Within your datastores you can also tell DS that you have linked databases setup, this way DS can take advantage of any linked server you have setup as well. Just go to your datastore, click edit, and the last field under advanced is where you tell it that you have a linked server setup to another datastore.

Also look at how you are building your dataflows. If you have a long complex dataflow that joins lots of stuff together, then does a bunch of processing, and then joins in another large table; consider splitting this out into two dataflows. Send the data back to the database mid-stream and then pick it up again. Often it turns out that having a few smaller dataflows that can be pushed down to the database runs a lot faster than having one large complex dataflow where DS is pulling all the data onto the server.

Now that you've got as much as possible running on the database, you can take full advantage of all those indexes and partitions my interviewees were telling me about in part one!

2 comments: