Monday, April 15, 2013

Job Control Table - Part 2 - Auditing

So you may have noticed I called my job control table SYS_AUDIT in the previous past, as opposed to JOB_RUN or JOB_NUMBER. 

The other names are fine and might make more sense today, but the name relates back to the history of why I first created this kind of table.


There was a requirement to know which rows were loaded within each load so that we could trace back any data issues to a particular job run. Perhaps there was a data issue in the source system at load time, or there was an error in a particular load and we needed to clear things out.


In order to do this, I created the SYS_AUDIT table, and then I stamped the AUDIT_ID against every row loaded by the job. 


In my previous post I created a new line in the SYS_AUDIT table by assigning the audit id to a global variable called $G_AuditID like this:


$G_AuditID = NewAuditRecord($G_JobName, $G_EffectiveDateAndTime);


I then use that variable in my dataflows to assign an audit id to every row. To do that I create a field on every table called AUDIT_ID in which I stored $G_Audit_ID.


Now if I need to get just the rows that were updated for a particular load I can get them with some pretty simple SQL.


Select * from TABLE where AUDIT_ID = 32


Also, if a load just inserted a whole bunch of dodgy data I can get rid of it.


Delete from TABLE where AUDIT_ID = 32

5 comments:

  1. hi Sean,

    I have a scenario where there are 4 source tables (datastore1) in a dataflow loaded to 4 different stage tables in parallel like,
    Source1 --> Query Transform1 --> Stage1,
    Source2 --> Query Transform2 --> Stage2,
    Source3 --> Query Transform3 --> Stage3 &
    Source4 --> Query Transform4 --> Stage4

    In the 2nd dataflow, the data from these stage tables is loaded to Target tables (datastore2) in parallel like,
    Stage1 --> Qry Transform1 --> Target1,
    Stage2 --> Qry Transform1 --> Target2,
    Stage3 --> Qry Transform1 --> Target3 &
    Stage4 --> Qry Transform1 --> Target4

    Now, I want to get the Source Table Name, Source Table Row Count, Source Extracted Datetime, Target Table Name, Target Table Row Count, Target Load Datetime in a separate database table (say ROW_COUNT_TBL).

    Can you please suggest me how to achieve this?

    ReplyDelete
  2. Just feed all the tables you want to count the rows for into queries where you have fields for count, table name and sysdate, with ROW_COUNT_TBL as the target.

    BTW I don't recommend putting so many parallel flows into a single data flow, its much better to split your flows out into one data flow per target table.

    ReplyDelete
  3. Thanks for your response Sean,

    But the scenario is as such the source flows are in parallel.
    As per my understanding, Using Sql transform as source for the ROW_COUNTS_TBL, I can get the source & target table row counts with a simple sql like 'select count(*) from Source1' & 'select count(*) from Target1' and I'll map these fields to Source Table Row Count & Target Table Row Count fields. But since I have 4 source tables in a single dataflow & 4 target tables in another single dataflow, so how can I get the Source Table Name, Source Table Extracted Datetime, Target Table Name & Target Table Load Datetime fields.

    ReplyDelete
  4. Like I said previously, don't use a single data flow with multiple parallel flows, use multiple data flows instead.

    Don't use the SQL transform either, its very bad practice.

    Just do counts, hard code the table name and sysdate for the rest of your fields. You are over complicating something that is really very simple to do.

    ReplyDelete
  5. subscribing to your feed and I hope you will keep a good work! SAP Compliance

    ReplyDelete