A theory of data warehousing

This overview gives you a practical introduction into the creation of a data warehouse. It assumes that the reader have a base knowledge of data warehouse systems. Hence, the purpose of a data warehouse is not discussed here; instead we focus on how the data warehouse is built.

In a series of contributions we develop a data warehouse around a realistic application. We then first discuss this realistic application. We do so by taking a technical point of view. We give you the so-called DDL – the actual code from which tables can be created. We show what happens if a user wants to do something tangible, like adding an order. We will investigate what records are to be inserted as result from that action.

We then continue by sketching an architecture in which a data warehouse can be built on top of such business application. This architecture describes what functionalities are needed to create a separate business intelligence environment. This architecture contains several functionalities. One functionality is the storage of history: information from the applications are collected and stored. Another functionality is the creation of a separate environment where business analysts can do their analysis work without giving disturbance to other business processes. A third functionality is giving room to the creation of reports that a meant for business analysis. Hence we need an architecture that combines the three functions (history, separation and report creation) into one physical environment.

Once the architecture is ready, we can continue by filling out the building blocks that constitute the architecture. Each building block can be described in terms of its purpose, input and output. 

REASONS TO START A DATA WAREHOUSE

I would appreciate to start the discussion around the architecture of a business intelligence environment with the question on why such environment is created.

The first raison d’etre is to create an environment where all relevant history of an organization is stored. We realize that most organizations would like to have an environment where all interesting copies of records are stored in one common framework. This already started about twenty years ago when so-called information warehouses were created. I have seen one organization where regular copies of databases were collected together in one large storage. The idea was that analysis could be undertaken on these copies. To facilitate the analysis, copies of different databases were collected together in this one storage room.

The advantage of this approach is threefold:

[1]: The collection of regular copies creates a situation where dismissal of old records in a business application does not affect the storage of historical records in the organization as these records are already included in the business intelligence environment. Let us assume for this discussion that daily copies of records of records are sent to the business intelligence environment. If the business application decides to delete records that are older than 3 months, we still have the copies of elder records within the business intelligence environment.

[2]: The logic to extract the copies from the business applications and to store these in a separate environment is relatively easy. If every day a copy is extracted and sent to the business intelligence environment, we have a logic that looks like:

· Select records within the business application that belong to a certain date (today/ yesterday?). This could be created by a SQL statement that looks like: select * from relevant_table where date=today.

· Store these records in files. In an Oracle environment this could be implemented by a so-called spool on- command. Within the sqlplus client application, we start with a command like “spool <filename>”. Subsequently all output is stored in a file with the name <filename>. If we have a sql command like select <attribute> ||’;’|| <attribute> from relevant_table where date=today., all output is stored in a CSV file. We finish the collection of output by the command “spool off”.

· Send these files to the business intelligence environment. This could be done by ftp: this service allows to send a file from one server to another server. If the file with data from the business application is stored on server A and the business intelligence is stored on server B, we could use ftp as a mechanism to send the file from A to B.

· Store the file in a business intelligence environment. We could store the copy of the data in a table in a database that constitutes the business intelligence environment. As example, we could think of a SQL statement that looks like: insert into business_intelligence_table values( today, Attribute1, Attribute2 ) etc. We


[3]: Have a separate environment for business intelligence.