One setting that is often overlooked when using the function is the caching method, labeled as Cache spec.
There are 3 options you can choose from:
- DEMAND_LOAD_CACHE
- NO_CACHE
- PRE_LOAD_CACHE
Most people just leave the default (PRE_LOAD_CACHE) in place. But what is the difference between the 3 options?
NO_CACHE - This option reads the values from the lookup table for each individual row, but does not cache the value.
PRE_LOAD_CACHE - This option will load the entire lookup table into memory first before looking up the values you need.
DEMAND_LOAD_CACHE - This option will load the individual rows into memory as they are identified. So the first time DS needs a row it looks into the database, finds it and places it in memory. The second and subsequent times it looks into its own memory for the value.
So which option should you use, and what impact on performance will this have?
DEMAND_LOAD_CACHE - Use this option when you have a small number of input rows, but the table you are looking up values in is large. In this instance, using DEMAND_LOAD_CACHE over PRE_LOAD_CACHE can have a huge impact on performance.
Here is a great real life example. I recently had to run a data flow where the source table had 42 records, and the lookup table had 3,390,904 records.
I ran the job once with PRE_LOAD_CACHE to see how long it would take:
I then ran the job with DEMAND_LOAD_CACHE to see the difference:
As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.
PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows in the lookup table.
I needed to use the same 3 million row table as a lookup table again, but this time the source table had 161,280 rows.
This time I ran it with DEMAND_LOAD_CACHE first:
Then I ran it with PRE_LOAD_CACHE:
In this case using PRE_LOAD_CACHE was the better choice as my source table had a lot of values when compared to the lookup table, so in effect I was looking up a far greater number of rows in the lookup table than I was in the previous example.
NO_CACHE - The only reason I can think of to use NO_CACHE is if you are running short on memory and are only accessing very few rows in the lookup table.
So as you can see from the details above, choosing the right cache option can have a huge impact on your jobs performance.
In summary the rule is, if you are going to access most of the rows in your lookup table, then use PRE_LOAD_CACHE, if you are only going to access a few rows, then use DEMAND_LOAD_CACHE.
And if you can't figure out which one to use? Well just run the job once with DEMAND_LOAD_CACHE set, and once with PRE_LOAD_CACHE set and see which one runs the fastest for you :-)
Great blog!
ReplyDeleteI'm a DS Certified too and I liked the way you share your knowledge.
If you need some help, please let me know.
Best regards from Brazil,
VinÃcius Fossaluza.
Thanks VinÃcius, glad you like it.
DeleteImpressed with the explanation....i have used lookup_ext many times but i did not test in this way...next time i will test like this and come back with the results
ReplyDeleteHello Sean A. Hayward,
ReplyDeleteIt's a Great Blog.....
Thank You So Much....