A script to create a sample data warehouse – part 1 the dimension

In the last week, I have written a sample script to create a skeleton data warehouse. I will use this as a head start for future work where the skeleton will be used to implement an actual data warehouse.

The skeleton is written for a Teradata DBMS. This is the platform that I currently use.

The example data warehouse has two tables: a dimension table and a fact table. Hence we have a small example that can be expanded in the future.

Let us start with the dimension table.
See also the script with ExampleDW
The dimension table is created with:

drop table devADW_BAN_MI.dimTARGET;

create table devADW_BAN_MI.dimTARGET
(SK DECIMAL(15,0) GENERATED  BY DEFAULT AS IDENTITY
           (START WITH 1
            INCREMENT BY 1
            MINVALUE 1
           NO CYCLE
            ),
 BK varchar(3),
 extra varchar(256),
 VERSION_SDATE timestamp(0)  ,
 version_edate  timestamp(0)   );

This dimension has a surrogate key (SK) that is loaded with the DBMS facility. This leads to the situation that every record is given an attribute that can be used as a primary key. This SK is the primary key. It uniquely identifies a record within a table.
The table also has a business key (BK). This is something by which the business identifies an object (say a product, a contract or a customer). If we have a business key, we are able to know whether we have an earlier version of the object. If we get a new record on an object that we already have in the table, we may act accordingly.
In our case, we assume that we want to act in the following way whenever a new record is found on an object that we already have in the data warehouse.

  • The record is added to the table.
  • The record that describes the old state of the object is labelled as ‘old version’ .

In the table we have an attribute “extra”, that acts as an example of a descriptive attribute. It can be a colour of the object, the size, the owner etc.

We also encounter an attribute “version_sdate”  which indicates the starting point of validity of the record. We assume that an object generates over time a series of records. Once the record is sent as the newest version of the object, it is stored in the data warehouse. We use the attribute version_sdate as the starting point in time after which the record is seen as the valid version of the object.

 Likewise, we have a “version_edate”  that indicates the finishing point in time. If a record is stored in the data warehouse and if a new version is included in the data warehouse, the old record is updated with version_edate equals finishing point in time.

A source table is created with:

create table devADW_BAN_MI.dimSOURCE
(BK varchar(3),
 extra varchar(256));

This represents the source of data. This source contains two attributes: a business key (BK) that identifies an object and an additional attribute (extra) that contains a description of the object.

Let us then load the source:

insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('BKO','BKO extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('XYZ','XYZ extra');

Ok. we now have a source that is loaded and a dimension table that is still empty. This can be initially loaded with:

insert into devADW_BAN_MI.dimTARGET(BK,extra,version_sdate,version_edate)
select BK,extra, current_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss')  
from devADW_BAN_MI.dimSOURCE

This gives:

	
          SK  	BK	extra	 VERSION_SDATE	 version_edate
1	100,001	XYZ	XYZ extra 10/2/2012 16:29:15 12/31/9999 00:00:00
2	300,001	BKO	BKO extra 10/2/2012 16:29:15 12/31/9999 00:00:00
3	1	SRM	SRM extra 10/2/2012 16:29:15 12/31/9999 00:00:00
4	200,001	TBO	TBO extra 10/2/2012 16:29:15 12/31/9999 00:00:00

Let us continue with a new dataload for the dimesnion table:

delete  from devADW_BAN_MI.dimSOURCE;
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM nog meer');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('ABC','ABC extra');
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');

We first deleted all records in the source. This is followed by loading three records.

In an incremental load step, we may have three situations:

  • the record represents an object that is not yet included in the dimensional table. Whether or not this is the case can be assessed with the business key. If a new business key is encountered, we have this situation. The algorithm is quite simple: do we have a record is the source for which the business key is not yet included in the target dimension table. The sql reads as:
--NEW records in source
insert into devADW_BAN_MI.dimTARGET
 (BK,extra,version_sdate,version_edate)
 select 
  BK,
  extra, 
  CURRENT_time, 
  cast ('99991231000000' as TIMESTAMP(0) 
     FORMAT 'yyyymmddhhmiss') 
 from devADW_BAN_MI.dimSOURCE 
  where BK not in 
  (select BK 
   from devADW_BAN_MI.dimTARGET 
   where version_edate > CURRENT_DATE
  );
  •  the record in the source is included exactly in the target table. In our case, we have a combination (BK, extra) that may already be present in the target table. In that case we have no reason to apply any change in the target table: the data are already present in the target. We may go on looking for records in the source that may need action.
  • the record in the source table represents an update on an object that is already present in the data warehouse. In that case we have a business key (BK) that is already present in the dimensional table. We also have a situation whereby the content in additional attributes are different from what we have in the dimensional table. In this case “extra” in the source is not equal to “extra”  in the target table. We identify these records as:
    create table devADW_BAN_MI.temp
    (BK varchar(3),
     extra varchar(256));
    insert into  devADW_BAN_MI.temp 
    select B.BK, A.extra
    from 
    (select BK, extra from devADW_BAN_MI.dimSOURCE) A,
    (select BK, extra from devADW_BAN_MI.dimTARGET 
      where version_edate > CURRENT_DATE) B
    where A.BK=B.BK and A.extra != B.extra;
  • Once the records that contain an update are identified, we must take 2 actions:

    1. Update records in the data warehouse that represent an obsolete version of the object. This update consists of setting version_edate to a date that indicates the end of a validity period.
      update devADW_BAN_MI.dimTARGET
      set version_edate = CURRENT_TIME 
      where  
      (BK  in (select BK from devADW_BAN_MI.temp))  
      and 
      version_edate > CURRENT_DATE; 
    2. Insert records that represent the newest update on the objects.
      insert into devADW_BAN_MI.dimTARGET
      (BK,extra,version_sdate,version_edate)
      select BK,extra, CURRENT_time,  
      cast ('99991231000000' as TIMESTAMP(0) 
         FORMAT 'yyyymmddhhmiss')  
      from devADW_BAN_MI.temp
      where (BK  in (select BK from devADW_BAN_MI.Temp)) ;