A script to create a sample data warehouse – part 2 the fact

I have created an example data warehouse with just one dimension table and one fact table. We have discussed how the dimension table must be loaded. Once the dimension table is loaded, we may start loading the fact table.

Let us see how that is done.

The fact table looks like:

create table devADW_BAN_MI.factTARGET
(PI DECIMAL(15,0) GENERATED  BY DEFAULT AS IDENTITY
           (START WITH 1
            INCREMENT BY 1
            MINVALUE 1
           NO CYCLE
            ),
 SK decimal(15,0),
 SBLrowid integer,
 getal integer,
 updat  timestamp(0) ,
 VERSION_SDATE timestamp(0)  ,
 version_edate  timestamp(0)   );

Here, we have a SBLrowid that provides us with an indicator on the event. This indicator will be used as a business key of the event. This enable us to differentiate between a renewed recording of an event that we already know of and a comnpletely new event. Whenever a new recording of an event is received, we would like to label the former recording as being obsolete. Likewise, the newest recording should be marked as a new recording with times attached as from what point in time the recording is valid.
The periods in which the data of a record reflect an actual situation is indicated by version_sdate and version_edate. The version_sdate stands for the start of the validity period whereas the version_edate stands for the end of the validity period.
If a record is received, it is assumed that version_sdate can be filled out with a sysdate to indicate a start of the validity period. Likewise, if the record is superseded, the point in time at which the record is superseded is included as “version_edate”.
I have included “getal” as an attribute that describes the features of an event. Also this event is assumed to be described by an attribute that stems from the dimension table. Therefore a reference to that dimension table is included. This reference is done via the use of the surrogate key of the dimension table. This provides us with all details from the dimension table that are linked to an event.

We also include an attribute “updat”. This indicates the time when a record is created. This attribute allows us to select a subset of the input records. Records that were processed earlier can then be skipped.

Much like the loading of the dimension, a seperate table is created that contains source records. Let us assume that at a certain point of time, we have these records available in the source:

	BK	SBLrowid	getal	updat
1	PIP	3	25	10/3/2012 15:08:29
2	BKO	2	1,200	10/3/2012 15:08:29
3	BKO	1	1,000	10/3/2012 15:08:29
4	XYZ	4	400	10/3/2012 15:08:30
5	TBO	2	1,300	10/3/2012 15:11:13
6	BKO	5	95	10/3/2012 15:11:14
7	XYZ	4	400	10/3/2012 15:11:14

Let us assume that the first 4 records are already loaded in the target fact table. That leaves 3 records to be loaded. These three records represent three different situations.

The first situation are records that are updates on an event that is reported earlier. In this case, the event is indicated by SBLrowid. We notice that for SBLrowid=2, we received a record with [BKO, 1200] whereas an update is sent with [TBO, 1300]. To handle this, we first create an intermediate table that only contains updates on events that are not yet processed in the data warehouse.

--temp tabel maken
drop table devADW_BAN_MI.temp;
create table devADW_BAN_MI.temp
( SK decimal(15,0),
 SBLrowid integer,
 getal integer,
 updat  timestamp(0) 
);
insert into  devADW_BAN_MI.temp 
  select A.SK,A.SBLrowid,A.getal,A.updat
  from (select SK, SBLrowid, getal, updat  
        from devADW_BAN_MI.factSOURCE
        left join devADW_BAN_MI.dimTARGET 
        on factSOURCE.BK = dimTARGET.BK
        where version_edate > CURRENT_DATE
        and (updat > (select max(updat) 
             from devADW_BAN_MI.factTARGET)
             )
        ) A, 
  (select SK, SBLrowid, getal, updat  
   from devADW_BAN_MI.factTARGET 
   where version_edate > CURRENT_DATE) B
where A.SBLrowid=B.SBLrowid 
   and 
(A.SK||A.getal != B.SK||B.getal);

These records are subsequently used in two different actions. The first action is to update the records in the target table for which an update is received

update devADW_BAN_MI.factTARGET
set version_edate = CURRENT_TIME 
where  (SBLrowid  in 
(select SBLrowid from devADW_BAN_MI.temp))
 and version_edate > CURRENT_DATE;

The second action is to insert the receords with the newest version of the events:

insert into devADW_BAN_MI.factTARGET
(SK,SBLrowid, getal,updat,version_sdate,version_edate) 
  select SK,
  SBLrowid, 
  getal,
  updat, 
  CURRENT_time,  
  cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss') 
 from devADW_BAN_MI.temp
 ;

The second situation is a situation that records on events are received from the source that are not yet processed into the target table. This can be inserted by:
```
--nieuwe records
insert into devADW_BAN_MI.factTARGET
(SK,SBLrowid, getal,updat,version_sdate,version_edate) 
 select 
  B.SK,
  A.SBLrowid, 
  A.getal,
  A.updat, 
  current_time, 
  cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss')
 from 
  (select BK, SBLrowid, getal, updat  
   from devADW_BAN_MI.factSOURCE 
   where  SBLrowid not in 
   (select SBLrowid from devADW_BAN_MI.factTARGET)) A 
   left join 
   (select BK, SK from devADW_BAN_MI.dimTARGET 
   where version_edate > CURRENT_DATE) B
   on A.BK=B.BK
  ;
```
This insert statement has two characteristics. The first characteristic is to select records on events from the source that are not yet included in the target table. The second characteristic is to look up teh surrogate key from the dimensional table with the business key that is delivered. From the source, we get a business key on items that we have stored in the dimension table. This is subsequently translated into a surrogate key with this dimension table. We need that surrogate key to identify exactly one record within the dimension table in which the correct versions of the dimension is stored.
The third situation is whereby records are sent from the source on events that are already stored in the target table. The content is also exactly equal between source and target. In that case, no action needs to be taken.

Breaking

A script to create a sample data warehouse – part 2 the fact

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe

A script to create a sample data warehouse – part 2 the fact

Door tom

Gerelateerd bericht

Oracle create loader file

Useful Scala Programme

Joining files by Pig

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe