SAP Data Services: September 2011

Friday, September 23, 2011

HAVING CLAUSE

So lets say you had a table of sales figures by date and customer, and you are asked to populate a new table that will contain just those customers who have bought more than $10,000 worth of merchandise from the company.

If you were going to do this in SQL it would be pretty straight forward. You would sum up the total sales per customer, and then only return those customers having total sales of more than $10,000.

eg. SELECT customer, sum(salesamount) FROM sales GROUP BY customer HAVING sum(salesamount) > 10000;

Now Data Services has the very useful Query transform that does nearly everything we need, but not quite.

If we have a look at the tabs available to us in the query transform we have a GROUP BY tab, but no HAVING tab.

So how do we get around this?

Well the first step is to create a query the gets the total sales per customer, grouping on customer name, and summing the sales amount.

Then pass the results from that query into a second query that filters down the results based on the results from the first query.

Not exactly rocket science, which is why I wish that SAP would just add the HAVING clause into the query transform. There are two main reasons why having the clause in the transform would be nice:

It would make data flows a bit neater. When you have a massive collection of transforms in a complex data flow, the last thing you need is extra transforms on the screen just to carry out a fairly trivial task.
The HAVING clause does not get pushed down. Take a look at the SQL produced by the above data flow:

Just the group by clause is getting pushed down, so you may end up bringing millions of rows onto the Data Services server, and then only return a handful into your target.

So if any-one from SAP is reading this blog, please give us a HAVING clause.

Wednesday, September 21, 2011

Scientific Notation

So the other day is was busy processing a text file which had a column full of numbers. Everything was going fine until I the numbers changed from this:

234.8909
3.998
4456.88929

to this:

34e-9980
2e253
27e-224

Data Services immediately objected that these weren't proper numbers and refused to continue. So what to do?

Well I decided to write a function that can convert varchars to decimals whether they are in regular or scientific notation. Here it is:

if (index($VarcharNumber,'E',1) IS NULL)
begin
Return to_decimal_ext($VarcharNumber,'.',',',38,12);
end
else
begin
Return to_decimal_ext(word_ext($VarcharNumber,1,'E'),'.',',',38,24) * power(10, to_decimal_ext(word_ext($VarcharNumber,2,'E'),'.',',',38,12));
end

Here's the breakdown of what this function does:

if (index($VarcharNumber,'E',1) IS NULL) - First of all, work out whether this number is in regular or scientific notation. If it a regular number then there is not much to do but convert it from a varchar to a decimal.
to_decimal_ext($VarcharNumber,'.',',',38,12); - This bit converts the varchar to a decimal using the to_decimal_ext function. The last two parameters are the scale and precision of the decimal. You might want to change these in your version of the function, or perhaps turn these into parameters to make the function a bit more flexible.
to_decimal_ext(word_ext($VarcharNumber,1,'E'),'.',',',38,24) * power(10, to_decimal_ext(word_ext($VarcharNumber,2,'E'),'.',',',38,12)); - This is the code that converts the scientific notation varchar into a decimal. I'm going to break this down into its component pieces.

So first of all we need to get the number prior to the e. We do this using the word_ext function. This function is really helpful when you have a string come through that is divided up by set characters. In our case the letter e fulfills just that purpose. So word_ext($VarcharNumber,1,'E') will return the first part of the varchar in the string divided by the letter e.

We then want to multiply the number we have just got by 10 to the power of the number after the letter e. word_ext($VarcharNumber,2,'E') will give us everything after the letter e.

Now that we have the number before the e and the numbers after the e, we can go ahead and work out what our number should look like in standard notation.

Lets say I had the number 23e12 and I wanted to convert it to standard notation. Here is what would happen:

to_decimal_ext(word_ext('23e12',1,'E'),'.',',',38,24) = 23
power(10, to_decimal_ext(word_ext('23e12',2,'E'),'.',',',38,12)) = 10 ^ 12 = 1000000000000
23 * 1000000000000 = 23000000000000

This works just fine for really tiny numbers like 23e-12 as well:

to_decimal_ext(word_ext('23e-12',1,'E'),'.',',',38,24) = 23
power(10, to_decimal_ext(word_ext('23e-12',2,'E'),'.',',',38,12)) = 10 ^ (-12) = 0.0000000000001
23 * 0.0000000000001 = 0.00000000000023

Execute Only Once - How to use components to improve your Data Services job design

One of my standard interview questions that stumps nearly every-one is- Which two objects have the 'Execute Only Once' property, and what does this do?

And of the very few people that get this question right, the next question completely stumps them - Why would you want to set an object to 'Execute Only Once'?

So, the first part of the question is easy to answer, in that data flows and work flows have the execute only once property. Right click on any work flow or data flow and take a look:

So what does this do? Well, pretty much what it says on the box, it only allows an object to execute once within a single run of a job.

So if we set this property to true for the DF_CUSTOMERS_Load data flow, it would only execute once, despite being in the job six times:

I know you would never create a job like this in real life, so when would be a good time to use this property?

Well lets say you have a job that loads 3 fact tables, eg FACT_SALES, FACT_RETURNS and FACT_ENQUIRIES, and each one of these facts shares a common dimension called DIM_CUSTOMER.

Now you could build your job by running all the staging tables, then all the dimension tables and then all the fact tables:

This method of building jobs has 3 disadvantages:

You can't start building any fact until ALL the dimensions have been loaded. It may be that your slowest fact table build only requires the fastest dimension build to complete before it can start, so you are wasting time by making that fact table wait for all the dimensions to be built first before it can run.
Lets say you want to split out the 3 fact tables into separate jobs? You'd have to go through each fact data flow and click 'View Where Used' on each source table to make sure you get all the right dimensions and their associated data flows out of the job and into the new job.
When it comes to debugging, if you have a bug in your fact table, you'll need to run all the staging and dimension tables, even if they aren't needed for the fact you are debugging.

So how should you do it differently and how does the 'Execute only once' property help?

Instead of having 3 big workflows for all the staging, then dimension and then fact builds, I would suggest creating a workflow for each complete table build and calling that workflow a component. Within each component will be two further work flows, one containing all the dependencies for that table ie the components for the dimension tables, and one containing the actual data flows needed to build that table.

Perhaps a few pictures would help.

So first off I'd create 3 work flow components called C_FACT_SALES, C_FACT_RETURNS and C_FACT_ENQUIRIES. My definition of a component is a work flow containing everything that is required to build the table it refers to.

Within each component create two work flows. One of these work flows will contain all the data flows needed to build the table itself, and the other will contain all the components needed to build the tables that need to be built first before the table build can start - usually all the staging and dimension tables needed for that fact table.

So in the above example the work flow WF_FACT_SALES_Dependencies would look like this:

Each of the components above will have the 'Execute only once' option set. So the C_DIM_CUSTOMER component can be in each of the fact table components, but it will only get executed once.

So in this example the data flows to build the FACT_SALES table will be able to run as soon as the 4 component work flows above have completed. We can also now also run the C_FACT_SALES component on its own and get everything we need to build the FACT_SALES table.

I create all my Data Services using this component methodology, taking full advantage of the 'Execute only once' option. It means that my jobs:

Run faster
Allow for a quick view of the dependencies required to build a table
Are easier to split into separate jobs
And are easier to debug as I can just run the offending component, rather than the entire job