A check list for the assessment of a data warehouse

Yet another checklist

Regularly, one comes across an assignment whereby one has to assess a data warehouse. (at least in my type of work). I found a nice checklist that allows a quick overview on a data warehouse. It is composed of several sections, each one covering a certain area. Going through this list, then provides the reader with a blue print on how the data warehouse is doing.

Requirements

The first area is “requirements”. It is composed of 4 items:

Do we have a concise architecture that shows the main components of the data warehouse? It should show what the major data blocks are and how the data blocks are related to the solution that is required from the business. This architecture can be created on one A4 that may be used in discussions on the data warehouse.
What are the major security aspects? Who is given access?
What functional requirements do we have? This could show a list of reports that are required.
What non-functional requirements do we have? This covers timing issues, latency etc. Questions are raised on the performance of loading, reporting, extracting etc. Here is may be stated that the data should cover a ten year period, with most recent data being the data of last month.

These items must be agreed upon by the business. They need to recognise the requirements,and they should have the idea that these lists offer a comprehensive overview.

The sources

Another area is a list of items on usage of the data warehouse as a source. These items are concentrated around the questions on whether the data warehouse is fit for purpose to supply the necessary data.

(If the data warehouse is used as a source), do we have approval from the owner of the data warehouse? Data in the data warehouse can be used for a variety of purposes. One should have a formal approval for this usage: it should not interfere with other processes in the data warehouse.
Is the purpose of the application that uses data from the data warehouse in the realm of reporting and/ or planning? A data warehouse is created for reporting purposes. The architecture reflects this. If applications use data from the data warehouse they should appreciate such architecture. The purpose of data usage should be in accordance with such architecture. As an example: a real time transaction system can not use the data from a data warehouse.
Is the latency in the data warehouse compliant with the reporting requirement? It takes some time before data from a source system are reflected in the reports. The business requirements for the reports timeliness must be in accordance with these technical constraints.
Does the reporting system need conformed data? A data warehouse is set up to provide a single version of the facts As an example one may think of a customer that is represented in a single version from the data warehouse, despite the fact that multiple versions can be stored in the source systems. The translation from multiple versions in the source systems to a single version in done in the data warehouse. This is an important purpose of this data warehouse. The reporting requirements should realise this mechanism. If their requirements do not need such a single version, then one may question as to whether the reports actually need to use the data warehouse as source. A different source might be more applicable. As an example: if the business request an overview of purchases from one business process, one may question as to whether a data warehouse is the most appropriate source.
The service levels of the reporting systems should be lower than the service levels of the data warehouse. The data within the data warehouse should only be used for purposes that are aligned with the stated service levels of the data warehouse. An example may clarify this. Suppose the data warehouse has a service level agreement that only allows for maintenance during office hours. One may not use the data warehouse as a source for an application that requires 7*24 support (such as an ATM application). Intuitively, it does make sense not to allow a data house to act as source for on-line banking. This item formally deals with this issue.
Do we have an integrate usage of fact and dimension data from a data warehouse? Most likely, so-called fact data are stored with foreign keys that point to dimensions within the data warehouse. One needs to use both the facts and dimensions from the data warehouse. Most likely, it is almost impossible to use facts from the data warehouse in combination with dimensions from other sources.
Does the reporting application require only dimensions that are generally used? In some cases, a reporting application requires dimensions that are only maintained in individual source systems. Most likely such local dimensions are not propagated into the data warehouse. It might even be that storage of such localised dimensions is not compliant with data warehouse architecture. Hence when these requirements exist, one my question as to whether the data warehouse is the most appropriate source.
Do the technical means whereby data from the data warehouse are retrieved, comply with the general mechanisms employed? One would like to have a single mechanism whereby data are retrieved from the data warehouse. As an example: a general mechanism might be that data are pushed out as flat CSV files. If that general mechanism is employed, an individual reporting project should not deviate from it.
Is the role of the data warehouse that of “System Of Record”? As an example, one may think of a data warehouse that has gotten the role of an overall general ledger as it acquires several accounting systems and it is used for financial reporting, It that case the data warehouse is used as a “System Of Record”. To elaborate this item, the definition of “System of Record” is repeated here: “A system of record (SOR) or Source System of Record (SSoR) is .. the authoritative data source for a given data element or piece of information.
” Definition from Wikipedia. The most important word in this definition is authoritative. If the data get that role, we may have legal requirements that are related to this role.

Allocations

An approval for allocation must exist. Such approval is necessary to avoid performance issues that may result from allocation processes. The idea is that allocation is used to allocate records over lower levels of detail with other tables in the data warehouse. This may have effects on performance, capacity etc. An approval from the owner of the data warehouse is therefore necessary.
The allocation process can not be overly complicated. This is also directed at avoiding performance issues. As the data warehouse is a shared resource, each process must be limited to avoid performance and capacity bottlenecks. This then also holds for allocation processes.
The allocation process must be decoupled from other processes. The idea is that the processes in the data warehouses should not depend on other processes in order to be able to take responsibility for the outcomes of such processes. Hence the allocation process should not rely on tables that are outside the data warehouse.
Is the allocation process a scheduled / repeatable process? The idea is that all processes in the data warehouse should be scheduled and repeatable. No room exists for regular manual interventions as such processes may lead to non-traceable data; this then also holds for the allocations within the data warehouse.
As the data warehouse is created to support reporting processes. Hence, the allocated data can only be used for reporting purposes.
It is possible to retrieve allocation factors from an external source. In that case the allocation factors is a normal source that should comply to normal source requirements.
It is possible to use allocated data from the data warehouse as a source for, say, reporting purposes. In that case the same rules as using the data warehouse as a source apply.
It is possible that the allocated data get the role as “System Of Record”. In that case legal requirement pertain that are related to “System Of Record”.

Data Acquisition

Is the source identified? This refers to the situation that new data that are necessary for the data warehouse are known, identified and a source is known. The data source must also be investigated: a so-called profile must be created. This shows what the values were that were encountered.
What is the impact of data capture on the source system? Is such impact in accordance with other service level agreements on this system?
The decision to load new data must be considered against the possibility of re-use existing data in the data warehouse. It might well be that data that are required, are either already in the data warehouse or such data are already loaded in the staging area. Only if the data are not already in either of these environments, a new acquisition is necessary.
Do we recognise special techniques that are needed to load dimension data? Dimension data are special data as most likely most dimension data exist both in the data warehouse and in the source system. The techniques that must be employed are directed at finding the differences between source and existing data in the data warehouse along with a subsequent update of such data in the data warehouse.
Do we recognise special techniques that are needed to load fact data? Fact data stem from new events in the source system. Such new events need to be adequately captured and propagated into the data warehouse. It might be that “Change Data Capture” is employed here.

ETL

Do we have a dedicated staging area that allows to store the files that come from the sources? The staging area is an exact copy of the files / data structures that are sent by the sources. This is subsequently employed to update the data warehouse.
Is the “System Of Record” used to retrieve the data? The idea is that data must be retrieved from an authoritative source in order to be used in a data warehouse. As an alternative data from an exact copy can be used, as long as the data do not get changed between the authoritative source and the copy environment. Doing so, we are sure that we only get reliable data in the data warehouse.
Are the data retrieved via the fastest / best known loaders? Also here, an example may clarify it. Oracle tables may be loaded via SQL insert statements or by a direct loader like the SQL Loader. The decision on whether to use one loader or the other should be a thought over; this item requests such a decision process: the designer should know the different options to load a table and should have a well-informed decision.
Is a standard naming convention being used in the ETL process? As the mappings, jobs, workflows etc all become part of one comprehensive ETL structure, one recognisable naming convention must be adhered to.
Are standard blocks used for standard operations? As an example: Informatica knows so-called “mapplets” that can be re-used for standard operations.
The specific mappings are project specific. As each project has its specific mappings, we should have a specific mapping to capture this.
Are general standards adhered to? As most organisations have a set of standards on ETL, such standards must be applied in the ETL procedures.
Is error handling properly covered?
Is process control in place?
As far possible, a single tool needs to be used in ETL.
Whenever a choice must be made between processing in the ETL tool or processing in the DBMS, such choice must be carefully made.

Data Modelling

Do we have a full mapping on every attribute from the data warehouse model back to the source? In most organisations, we have an Excel document where each row indicates how an attribute from the source is used to populate a data warehouse attribute.
Do we have a document that explains the data model for the business? As the real data model might be too complicated, we should create a separate data model that leaves out unnecessary details. One might think of one A4 that provides a high level overview of the data model.
If revisions have occurred on the data model for the data warehouse, are these revisions implemented in the high level overview as well? One may think of the situation that at some point in time, the high level overview and the data model are perfectly aligned. Each change in either of the two documents must later be aligned to the other document.
If the organisation uses a standard data modeling tool, it must be used to capture the data model of the data warehouse. The usage of this tool must be in conformity with best practices.
When the organisation uses other standard documents, such documents must have been used. This allows documentation that is recognisable for future maintenance.
If the organisation has a central modeling agency, such an organisation must have been informed and engaged.
If codes are used in the model, such usage must have been aligned with other code usage.
If a logical data model is created alongside a physical data model, such data models must be aligned. In some tools, this alignment is supported. Both Sybase Powerdesigner and CA Erwin have a capability where changes in the logical data model are automatically propagated into the physical data model.
If a project only implements part of the total model, such update must be aligned with the comprehensive data model. I could imagine that a project starts with a up-to-date copy of the data model. The project then continues by adding new entities/ tables with their attributes/ fields. If the project is finished, the new elements can be merged into the new model.
The data definition language (DDL) that is used to implement the addition of new tables/fields is generated according to standards that are in use in the organisation. It might be that audit tables must be updated upon new the introduction of new tables; the generated DLL then needs to reflect this.
A new table needs to be performant; some general rules may enhance performance. An example might be a check on having an index on primary and foreign keys in large tables. However other checks may be applicable such as having a primary index in Teradata tables.
Is the standard security on new tables being implemented? I once worked in an organisation when each table had a view. Only the view was accessible by the end-user. This was done for security reasons. Hence is that organisation, a check was done on whether such a view was implemented for any new table.
When a datamart is created, the security that is mandatory, must be adhered to. An example may clarify this. Assume that an organisation requires that tables in the data warehouse can only be accessed via views. In that case, the tables in a datamart can only loaded via a view. Such security needs to be adhered to.
The access privileges need to be explicitly mentioned and these privs need to be compliant to general security policies in the organisation.
Other structures such as triggers, stored procedures, database links etc. must be explicitly mentioned and they must be compliant to the organisation policy.
Naming standards for new tables, attributes, views in the data warehouse are adhered to.
Naming standards for the data mart are adhered to.
If groupings like schemas are used, the grouping standards must be complied to. I might be that dimension tables are positioned in dedicated schemas. The new tables must adhere to this standards.
Derived data can only be used if there is a good reason to have them included. This rule prevents addition of intermediate values that have no value for future usage. This rule is also directed at keeping the data warehouse clean and neat. Also, the derived data must be stored in appropriate locations.
If tables are split, we should properly document such split. As an example: is a table is split according to regions, we should have such regions being reflected in table in the data warehouse.
If Control Tables are used, such Control Tables must be created according to general architecture.
All service accounts that are used to start the ETL flows are identified. No personal account may be used to kick off the ETL flows.

Data Mart

The set-up of the Data Mart is quite tool -dependent. The tool determines to a large extent the data model, as quite some data models are not supported.

Generally speaking, most tools require a star-like model: one fact table with some dimension tables. Other data structures are not supported: a recursive relation is not supported; some tools do not support non-hierarchical tables; some tools do not support snow-flakes; some tools are not able to deal with large dimension tables; other tools are not able to handle many dimension tables whereas other tools are not able dimension tables with too many fields.

Hence, many items in any check-list are related to limitations that are put on the data model in the Data Mart.

Are recursive relations avoided?
Are snowflakes avoided?
Are unbalanced hierarchies avoided?
Is the report related to one fact table only?
Is the report related to dimension tables that do no transgress a certain threshold?
Are the names of related attributes the same (as to avoid query cross-joins)?
The number of dimensions should not be too large (recommendations mention 7 as an ideal number).

Other check list items are related to the choice where logic is implemented. We might have three separate locations where to logic:

when data are stored in the data mart
when data are defined in a cube
when the final report is created

The idea that each area has its advantages and we might use this:

when data are stored in the data mart, we might use the ETL tools that allows tp use power of the ETL tool or the DBMS power,
when data are defined in a cube, we might use the summary possibilities of the cube definition
when the final report is created, we might use the formatting possibilities

This implies that a well thought off choice must be made.

Tooling and infrastructure

Is the set of tooling compliant with the general architecture? In many organisations, some tools are designated to be use for certain purposes: Informatica is meant for ETL; MicroStrategy for reporting etc. In general, an individual project may not deviate from such general architecture.
Is the hardware compliant to the general architecture? In many organisations, a certain server is dedicated for web reporting; another is designated for file storage. In general, a project should use such designated components.
In many organisations, a reference architecture is created. This reference architecture indicates which logical components exist and where such components are located. This is quite well understandable as a data warehouse is a shared resource that must be used by different projects. A new project then needs to know where his items must be stored and how such items must be integrated in the existing structure.
If components are decommissioned, such components need to be enumerated. The decommissioning needs to be described and the alternatives need to be indicated.
The design needs to contain a list of external interfaces. All relevant sources need to be indicated on an inventory.
An assessment on future support needs to be given. The main question as to whether the solution can be maintained is addressed here.
Likewise the costs of maintenance and monitoring needs to be assessd.
License costs need to be clear.

Security

Do we have a concise document that shows whether the data are sensitive? This requirement shows in one overview what type of data we have and what data should be accessible to whom.
Do we have a person who has responsibility for security within the project?
When data are retrieved from a 3rd party, is that information stream compliant to security demands as stated by the organisation? As an example, one might think of FTP usage; if FTP usage is prohibited by an organisation, an individual project should refrain from setting up retrieving data via FTP.
The design document should explicitly refer to organisation guidelines on security. This reference should be as explicitly as possible. Such reference demonstrates that an individual project is aware of such guidelines and complies to these.
If sensitive data are used, a design should indicate how the general rules from the organisation on security are implemented in a data warehouse project.
If the general rules on security do not cover all security arrangements within a project, the extension should be described explicitly.
Do we have a mapping document that describes who may access which data?
Which non-users system accounts do we have? Such users are needed for ETL flows and these users should be described along with privileges.
Does the organisation check the implementation of security designs?
If exceptions to policy are necessary, are these agreed upon by a central security department within the organisation?
Are all exceptions to the policy explicitly shown in the design?
Batch applications only use system account; no user accounts are used for this purpose.

Usage of dimensions

Are standard dimensions used whenever possible?
If additional dimensions are needed that are not yet available as standard dimensions, will such additional dimensions be available for shared usage.
Is a path foreseen whereby dimensions created within a project, can be promoted to standard dimensions?

Data quality

Do we have a data quality environment (tooling/ hard disk space etc.)?
Does the qualitative environment have access to the full data volume?
Do the users have access to the data quality tools?
Is it possible to create the data quality reports?
Do we have a set of data quality business rules that must be checked? An example might be that a certain event must always refer to a certain dimension. This can be translated that a foreign key can not be null. A rule can be created, saying that no records may exist that have a NULL as value in the foreign key. Finally a report can be created that shows the records that transgress such rule. This rule then requires that such a list of business rules exist.
How many times are the data quality rules checked?
Is the format of reporting on data quality rules been agreed upon?
Who will be contacted when data quality issues are discovered?
Is the implementation of data quality rules part of a release (or is a part of an on-going data quality programme)?
Are the data quality reports scheduled?
Is every data quality rule related to source system? After all, the source of an error in the data stems from a certain source. Hence, upon discovery, such transgression must lead to a call to a source in order to improve data quality.
Do we have KPIs that enable us to monitor data quality? As an example, one might think of a KPI that calculates the average number of wrong records. Such a KPI allows us to monitor the development over time of data quality.

Unattended loading

We need an automatic purge mechanism that moves data when such data pass the retention period. Such mechanism needs to be automated.
We need an automated process that copies data from production environment to the development environment. It might well be that only a a subset is copied. This can be done to save space.
We need quality tests to see if the loading process works correctly or not. To this end, we might need to check if we do not have duplicates in the target (due to double loading) or missing loading (due to records not picked up from the source). Such quality check must be automated.
We need quality checks to see if totals from sources can be reconciled with totals from targets. Such tests are closely related to the previous item. The previous item calculated numbers of records; this item addresses the content of records. This content needs to reconciled between different areas.
Files that sent to the Staging Area and which are loaded into the data warehouse, are subsequently sent to the archive. This should also be an automated process.
A purging mechanism needs to exist for the files that are captured in the archive. After a predefined period, such files can be purged from the archive. This retention period needs to be parameterised to allow for changes in this policy.
Credentials should not be hard coded. Ideally such credentials need to be stored in centrally stored parameters. This allows an easy update of such credentials.
If audit logs and/ or error logs are created, one needs to have a policy on the retention of such files. It might be that such policies already exist on programmatic level; if such policy already exists, each project needs to adhere to it.
All jobs need to be scheduled; the loading process should be an automatic process that only needs manual intervention in case of errors.
The jobs need to be created on a low granular level. Such jobs need to concentrated on modifying one table only. Separate jobs can be combined into a batch to combine actions that involve updates on different tables.
The tables that are loaded need to be checked as to whether their keys are still contributing to to good performance.
The schedule needs to well documented to allow proper maintenance.

Standards

The final set of items refers to standard architectures in an organisations to which a project must adhere to. In most organisations, such architecture is translated into a set of documents/ checklists. We should have an explicit list where such documents are mentioned, along with a check as to whether the designs comply to this document and an accompanying list of deviations from the documents.

In this list, we should also have a proof that an authoritative person has signed off the requirement of compliance.

Breaking

A check list for the assessment of a data warehouse

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe

A check list for the assessment of a data warehouse

Door tom

Gerelateerd bericht

Oracle create loader file

Useful Scala Programme

Joining files by Pig

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe