Monday, May 12, 2014

History preservation and using the filter in the table comparison transform

I just found another really handy use for the new filter field in the table comparison transform in the latest versions of Data Services.

If you are loading a dimension, and you are using history preservation, add a line in the filter field to only bring through current records.


This helps you with two things, one obvious, one not so obvious.

The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only need to compare against the current record of a given dimension value. Setting the generated key column makes sure the comparison only takes place against the most recent row, but the table comparison transform still has to go through all the records to get there. If you filter only for the current records, then you can compare against a smaller record set and get through it just a little bit quicker.

Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of what you had previously. It may not get you that much of a performance gain, but when you are trying to get your jobs down to near real time, then every second counts.

The second thing this helped with is dealing with expired historical data that becomes current again. Let me give you and example.

This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with a record in the asset team table that looked something like:


Team A became active again on 1-JAN-2014.

The problem I had was that before I put the filter in the table comparison to only bring through current records, the table comparison transform was ignoring Team A. It was doing this because it would look in the target table for a record with a natural key of A. It found the old record there, and because the effective dates are set by the history preserving transform, it was ignoring it because everything else was the same. I needed it to process this record as an insert so that the history preservation transform would process it correctly with the new effective date and CURR_IND equal to 'Y'.

By putting the filter in, I was filtering out the old Team A record from my comparison data set. That way the table comparison saw this as a new record and put it through as an insert. This is a fairly unique case where you have historical records for a dimension value but none of them are current. 

I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I had I think the best workaround would have been to create a view where you filter for current records, and then use that as the comparison data set.

Monday, May 5, 2014

SQL Transform - When should you use it?

Never!

You were expecting a longer post about this? Well alright then, never ever!

Whenever I see tons of SQL Transforms around a Data Services job, it is normally a sign of a lazy developer. Some-one who had written out the queries in SQL, and instead of building proper data flows using the Query transform and other built in Data Services functionality, just copied and pasted the sql into the SQL transform.

But what's wrong with that? Well plenty!

DS is NOT a glorified scheduler

For starters, you just bought a not-inexpensive ETL tool that you are now just using as a query scheduler. If all you want to do is automate your queries then write them into a  stored procedure and schedule it on the database. Don't waste your money on a world class ETL tool.

You've just broken Impact and Lineage analysis

One of the biggest selling points of using a tool like DS, is you can visually see where a field is used in your ETL and where its data ultimately ends up. Likewise, you can see where a particular piece of data came from by tracing it all the way back to the source system.

This is a fantastic governance tool for managing your warehouse. You can see if, and where, making a change to a source system will affect your data warehouse.

The second you put a SQL transform in the way, you've just broken your impact and lineage analysis. DS treats it as a black box, so whether you are using the lineage in the management console, Data Services Designer View Where Used function, or in Information Steward, you have just rendered it useless.

In the Datastore below it appears is if none of the tables are being used in any jobs.


But then after a little searching I find an SQL transform.


I look inside it and what do I find?


And this query is hardly complex. It wouldn't have taken very long to develop as a proper data flow with the Query transform, and now some future developer could make changes to the job thinking that they have everything covered, but not realize that 3 extra tables were hidden inside an SQL transform.

I can't tell you how many times I've needed to make a change, right clicked on the table in the data store and chosen View Were Used, made all my changes, only to later discover I've missed one that a developer hid in a SQL transform.

Squinting at code

One of the great things about using a GUI based ETL tool is that you can open a data flow and immediately get an idea of what it is doing. You can see the tables on the screen, see which transforms they flow through and understand what is happening to the data and where it is going. With a SQL transform you have to open it up, squint at the code to try and figure out what it is up to.

For simple SQL that's not a big deal, but a complicated query with plenty of tables and joins.... well now you're wasting my time, and my client's money too!

Should you really never use the SQL transform?

I worked in a company where they had a ban on using the SQL transform. Turn in your code with one in it, and it got returned to you to have it rewritten without the SQL transform. No exceptions.

I will admit, there are times when you will just have to use the SQL transform, but these should be rare exceptions. Sometimes you need to do something very complex, that's maybe only available on the database, so you have to use the SQL transform to take advantage of it.

Before you do it though, think really hard about whether you couldn't achieve the same thing using Data Services built in transforms. Even if you have to split it out over multiple data flows, it will still be better than using the SQL transform.