In the last week, I have written a sample script to create a skeleton data warehouse. I will use this as a head start for future work where the skeleton will be used to implement an actual data warehouse.
The skeleton is written for a Teradata DBMS. This is the platform that I currently use.
The example data warehouse has two tables: a dimension table and a fact table. Hence we have a small example that can be expanded in the future.
Let us start with the dimension table.
See also the script with ExampleDW
The dimension table is created with:
drop table devADW_BAN_MI.dimTARGET; create table devADW_BAN_MI.dimTARGET (SK DECIMAL(15,0) GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1 MINVALUE 1 NO CYCLE ), BK varchar(3), extra varchar(256), VERSION_SDATE timestamp(0) , version_edate timestamp(0) );
This dimension has a surrogate key (SK) that is loaded with the DBMS facility. This leads to the situation that every record is given an attribute that can be used as a primary key. This SK is the primary key. It uniquely identifies a record within a table.
The table also has a business key (BK). This is something by which the business identifies an object (say a product, a contract or a customer). If we have a business key, we are able to know whether we have an earlier version of the object. If we get a new record on an object that we already have in the table, we may act accordingly.
In our case, we assume that we want to act in the following way whenever a new record is found on an object that we already have in the data warehouse.
- The record is added to the table.
- The record that describes the old state of the object is labelled as ‘old version’ .
In the table we have an attribute “extra”, that acts as an example of a descriptive attribute. It can be a colour of the object, the size, the owner etc.
We also encounter an attribute “version_sdate” which indicates the starting point of validity of the record. We assume that an object generates over time a series of records. Once the record is sent as the newest version of the object, it is stored in the data warehouse. We use the attribute version_sdate as the starting point in time after which the record is seen as the valid version of the object.
Likewise, we have a “version_edate” that indicates the finishing point in time. If a record is stored in the data warehouse and if a new version is included in the data warehouse, the old record is updated with version_edate equals finishing point in time.
A source table is created with:
create table devADW_BAN_MI.dimSOURCE (BK varchar(3), extra varchar(256));
This represents the source of data. This source contains two attributes: a business key (BK) that identifies an object and an additional attribute (extra) that contains a description of the object.
Let us then load the source:
insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('BKO','BKO extra'); insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra'); insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM extra'); insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('XYZ','XYZ extra');
Ok. we now have a source that is loaded and a dimension table that is still empty. This can be initially loaded with:
insert into devADW_BAN_MI.dimTARGET(BK,extra,version_sdate,version_edate) select BK,extra, current_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss') from devADW_BAN_MI.dimSOURCE
This gives:
SK BK extra VERSION_SDATE version_edate 1 100,001 XYZ XYZ extra 10/2/2012 16:29:15 12/31/9999 00:00:00 2 300,001 BKO BKO extra 10/2/2012 16:29:15 12/31/9999 00:00:00 3 1 SRM SRM extra 10/2/2012 16:29:15 12/31/9999 00:00:00 4 200,001 TBO TBO extra 10/2/2012 16:29:15 12/31/9999 00:00:00
Let us continue with a new dataload for the dimesnion table:
delete from devADW_BAN_MI.dimSOURCE; insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('SRM','SRM nog meer'); insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('ABC','ABC extra'); insert into devADW_BAN_MI.dimSOURCE(BK, extra) values ('TBO','TBO extra');
We first deleted all records in the source. This is followed by loading three records.
In an incremental load step, we may have three situations:
- the record represents an object that is not yet included in the dimensional table. Whether or not this is the case can be assessed with the business key. If a new business key is encountered, we have this situation. The algorithm is quite simple: do we have a record is the source for which the business key is not yet included in the target dimension table. The sql reads as:
--NEW records in source insert into devADW_BAN_MI.dimTARGET (BK,extra,version_sdate,version_edate) select BK, extra, CURRENT_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss') from devADW_BAN_MI.dimSOURCE where BK not in (select BK from devADW_BAN_MI.dimTARGET where version_edate > CURRENT_DATE );
create table devADW_BAN_MI.temp (BK varchar(3), extra varchar(256)); insert into devADW_BAN_MI.temp select B.BK, A.extra from (select BK, extra from devADW_BAN_MI.dimSOURCE) A, (select BK, extra from devADW_BAN_MI.dimTARGET where version_edate > CURRENT_DATE) B where A.BK=B.BK and A.extra != B.extra;
Once the records that contain an update are identified, we must take 2 actions:
- Update records in the data warehouse that represent an obsolete version of the object. This update consists of setting version_edate to a date that indicates the end of a validity period.
update devADW_BAN_MI.dimTARGET set version_edate = CURRENT_TIME where (BK in (select BK from devADW_BAN_MI.temp)) and version_edate > CURRENT_DATE;
- Insert records that represent the newest update on the objects.
insert into devADW_BAN_MI.dimTARGET (BK,extra,version_sdate,version_edate) select BK,extra, CURRENT_time, cast ('99991231000000' as TIMESTAMP(0) FORMAT 'yyyymmddhhmiss') from devADW_BAN_MI.temp where (BK in (select BK from devADW_BAN_MI.Temp)) ;