Tuesday, April 1, 2014

Improving Performance - Part 4 - Bulk loading

Bulk loading is a great way to quickly load a large amount of data into your warehouse. The extra speed is achieved because the database will not log each individual row or apply row locks while the data is being loaded.

I find bulk loading especially useful when loading staging tables. As I typically truncate and load my staging tables I am just interested in getting my data in there as fast as possible, with the more complicated data processing taking place in subsequent data flows. So as I am just dumping the data into the staging tables I don't care about logging or row locking.

There is one situation where I tend to avoid bulk loading, and that is where the source and target tables are in the same database, and Data Services is able to push down the entire INSERT..SELECT statement to the database. This is because it tends to be much quicker to let the database run this internally, because when you turn on bulk loading all the source data is brought onto the server, and then pushed over to the target, regardless of whether the source and target are from the same datastore or not.

To use bulk loading double click on the target table in your data flow, and then click on the Bulk Loader Options tab.


Under the bulk load option change it from none, and then continue to set the rest of the settings to achieve the best performance for the database platform you are using. In this case I was using Oracle, so the options on this tab will be different depending on which database you are using.

You may need to work with your DBA to make sure that you database user has sufficient privileges to run bulk loading.

I have found that using bulk loading can have up to a 50% improvement in speed, which is especially noticeable when you are loading large tables.

2 comments:

  1. Regarding, "Data Services is able to push down the entire INSERT..SELECT statement to the database", I have source and target in the same databases in Netezza, but it does not push down the entire INSERT..SELECT statement, what checks do I need carry out? Where am I doing wrong? Please help.

    ReplyDelete
    Replies
    1. This all depends on what is in your data flow. If you have a source table, a d target table and a query transform in between them and all the functions in the query transforms are database functions, then it should use pushddown.

      The moment you use a Data Services only function, or use a transform other than the query transform, Data Services will need to bring the data onto the DS server to complete the data flow.

      Hope that explanation helped!

      Delete