sitemap Tom van Maanen

Remove the foreign key constraint

May 18, 2006 on 8:42 pm

It is generally assumed that the foreign key contraint consumes time. In an OLTP environment, this is surely acceptable but in a data warehouse environment, this may become a burden as millions of record might be loaded. If we have to check the foreign key constraint for every record that is added such check (allthough it may only cost less than a millisecond per record) may become a lengthy task.

The check.

erd

Let us start with above database scheme. Table “dimensie” acts as a reference table that needs to be checked on every insert of a record into “feit”. This is necessary as we have a foreign key contraint.

Let us do some exercises with data loads. We use sql loader to add records into a table. If we add a million records into feit, with the constraint, we have (on average)  an elapse time of 00:01:05.15. If the foreign key constraint is removed, this elapse time is diminished to 00:00:46.95. This decrease shows the importance of having this constraint.

It is possible to disable the foreign contraint by the command ALTER TABLE [table name] disable CONSTRAINT [foreign key]; It can be switched on later with ALTER TABLE [table name] enable CONSTRAINT [foreign key]; Toggling this constraint on and off may then lead to substiantil gains in time.

A bootable USB stick

April 15, 2006 on 9:21 pm

Turn the USB stick into something more than just storing files - turn it into something that may contain an Operating System. That is something I tried to accomplish.

Why is it nice to have an USB stick with an operating system?

 My attempt was to store an Operating System on an USB stick to have something that could be inserted on every machine with all applications that I liked. The aim would be that on any site, I could have my own system (stored on a USB stick) that could be launched on every modern PC. In that case, I would not be dependent on the client system that might have limitations that would be too stringent. I read that a small Linux system could be stored on such a stick. That would be nice: having your own Firefox, or open office without interference from a system owner.

 It works rather straight forward: one would have to insert an USB stick into the machine; then make sure that it is possible to boot a machine from the USB stick and afterwards the Operating System on the USB stick would reign the PC.

 Most modern PC allow to boot from an USB stick. Whether it is possible, can be checked by pressing the F2 or F12 button right after switching on the machine. On most machine, the boot sequence is shown and may be altered. Make sure that the USB stick is on top. If the USB stick is then inserted, the Operating System on the USB stick will be lauched.

I read that berlios on http://featherlinux.berlios.de/ created such a small sized Linux system. However, I missed in his article how to turn the USB stick into a bootable USB stick.

A standard USB is not directly fitted for storing an operating system. During the boot, it will only be recongnized as a bootable USB if it is partioned like such. One could use a small utility from HP to be able to create a bootable USB stick. This utility needs to have some system files. These can be created by formatting a floppy on an old Windows98 machine with the old “format a: /s” command. You remember: the old method to create a bootable flop that went into disuse as from Windows ME.

 utility screenshot

A screenshot of this utility is given above: it shows that the system files are taken from an ordinary bootable floppy and that they are stored on an USB stick.

 

As from now on, the USB stick is bootable. Now the search for an Operating System starts.

 

Pascal - still going strong

April 7, 2006 on 10:04 pm

A problem
In some cases, ETL software tooling does not give you an adequate possibility to transform the data as you would like to. Let us imagine a totally bizarre datamanipulation. Data that are stored in a matrix are to be multiplied with the sum of rowcount and columncount. The results should be summarized. That is difficult to accomplish in either Informatica or SAS or any other ETL tool.

How to deal here.
It is possible to write a third generation language program (C, Java, Pascal, Fortran etc) that works at element level.
I would advocate to use Pascal. It is relatively easy to learn and it has a large user base. It is often used at schools to get a grasp on programming.
Moreover, it is free.
At http://www.freepascal.org/ , you get a GNU copy of the Pascal compiler. Free, and very good. It seems to be the Turbo Pascal version 7 from Borland that is handed over to the open source community.
Let us first take a look at the bizarre data manipulation. It could be solved by the following Pascal program:

          1. program aapje
          2. procedure complex;
          3. type laag = record
          4. a: longInt;
          5. b: single;
          6. end;
          7. trij = array [1..5]of laag;
          8. prij = ^trij;
          9. tmat = array [1..5]of prij;
          10. pmat = ^tmat;
          11. var mijnrij : prij;
          12. mijnmat : pmat;
          13. i,j: longInt;
          14. bijhoud : single;
          15. bestand,uitvoer : textfile;
          16. begin
          17. new(mijnmat);
          18. for i:=1 to 5 do new(mijnmat^[i]);
          19. assign(bestand,’c:flip.txt’);
          20. reset(bestand);
          21. for i:=1 to 5 do
          22. for j:=1 to 5 do
          23. begin
          24. mijnmat^[i]^[j].a:=i+j;
          25. readln(bestand,mijnmat^[i]^[j].b);
          26. end;
          27. bijhoud:=0;
          28. for i:=1 to 5 do
          29. for j:=1 to 5 do
          30. bijhoud:=bijhoud + mijnmat^[i]^[j].a*mijnmat^[i]^[j].b;
          31. writeln(’totaal= ‘,bijhoud);
          32. close(bestand);
          33. assign(uitvoer,’c:flipuit.txt’);
          34. rewrite(uitvoer);
          35. writeln(uitvoer,bijhoud);
          36. close(uitvoer);
          37. end;
          38. begin
          39. complex;
          40. end.

The executable can be stored as a command in either Informatica or ETL Studio from SAS. Both ETL suites have workflow designers that enable you to include batch programs in the workflow

Storing data in a DBMS

March 18, 2006 on 11:31 pm

Storing data in a DBMS

 General

There are many ways to store data in a DBMS. One could choose between different DBMS; one could choose between different ways to connect to a DBMS. It makes dramatically difference which DBMS to choose from; it also makes serious difference how the connection to the DBMS is made.

I have demonstrated this by showing the difference between uploading 2 million records into three DBMS (Oracle, Teradata and SQL Server). Moreover, within Oracle we have tried three means to connect to Oracle. I will show that the differences are enormous. This demonstrates that it is important to contemplate seriously on how to upload records.

 The test: ODBC

I have established an ODBC connection to 3 different DBMS: to Teradata, Oracle and SQL Server. As client, I have used SAS. From the SAS client, I created 3 ODBC connections. Subsequently, I have loaded 2 million records into the three different DBMS. The results:

DBMS Elapse time (hh:mm:ss:hh)
Oracle 34:02.72
Teradata 1:02:21.84 ** stopped when 663968 records were loaded
SQL Server 5:07.70

The differences were impressive: going from slightly over 5 minutes in case of SQL Server to over 1 hour in case of Teradata. Note that in the last case, not even half of the records were loaded.

It shows that the choice of DBMS is important: it may make the difference between totally unworkable and a smooth alignment.

Oracle: ODBC, Native and SQLLoader

It is possible to establish different connections to Oracle. I have experimented with 3 different connections:

  1. an ODBC connection which is a nice solution since it allows a common API to different DMBS;
  2. a native Oracle connection, which is nice since it is the most direct client-server connection
  3. the SQL Loader which is a bulk loading mechanism. Below, I have elaborated this somewhat further.

In three tests, I have loaded 2 million records thru these 3 mechanisms: the results ranged from 1 ½ minute (SQL Loader) to over half an hour (ODBC). This shows that it is worthwhile to experiment a bit with the connection mechanism if you have decided to use a certain DBMS.

DBMS Elapse time (hh:mm:ss:hh)
Oracle ODBC 34:02.72
Oracle Native 4:01.77
Oracle SQL Loader 1:39.36

Oracle SQL Loader

The SQL Loader is the appropriate connection for bulk loading in Oracle. It is not trivial however. You need to create two files on the server:

  1. A control file that describes the data that are loaded. In my case, I used next control file:
  2. LOAD DATA
    INFILE weg.csv
    INTO TABLE ff2
    FIELDS TERMINATED BY ';'
    (i, s) 
  3. A data file (in my case it is the weg.csv file) that contains the data. The data are written in a textfile with a line that corresponds to a record. The lines contain the data in a sequence that is included in the control file. In the control file, I have indicated that each record contains the attributes “I”  and “s”. The control file also indicates that the data are loaded in an existing table FF2. Finally, the table ff2 should be empty: if it already contains records, an error is generated.

The files were uploaded via FTP to the server. Then, I have created two batch jobs on the server: one to truncate FF2 and a second to load FF2. The first job looked like:

@echo off
rem ———————————————————————–
rem Filename:   RunSql.bat
rem Purpose:    Run SQL*Plus script from DOS batch file
rem Date:       05-Mar-2005
rem Author:     Tom Van Maanen
rem ———————————————————————–
rem — Accept command line arguments –
rem Note: %1 is the first command line argument, %2 the second, etc.
if “%1″ == “” goto Usage
set TNAME=%1
echo Command Line Argument: %TNAME%
echo ff2 truncate >looprun
rem — Create sql script –
rem Use double ‘%’ when a real ‘%’ character is needed
echo connect scott/tiger@tom8 >%0.tmp
echo truncate table  %TNAME%; >>%0.tmp
echo commit;  >>%0.tmp
echo exit;  >>%0.tmp

rem — Run sql script –
sqlplus /nolog @%0.tmp
goto End

:Usage
echo Usage: %0 TABLE_NAME

:End

 

This batch job creates a file that connects to Oracle and then truncates FF2. It also creates a file (looprun) that I use in the second batch that loads the data via the SQL Loader:

IF NOT EXIST LOOPRUN GOTO END
SQLLDR USERID=SCOTT/TIGER CONTROL=WEG.CTL>F.TXT
DEL LOOPRUN
:END

 This second job loads the data from file weg.csv into table FF2. This only happens if file looprun exists. This prevents double loading.

Now, we have two files on the server and two batch jobs. The batch jobs were subsequently launched by a scheduler. The scheduler launches the batch jobs on fixed intervals. Thanks to the check on the file looprun, double loading was prevented.
 

Get data from Oracle

March 18, 2006 on 10:00 pm

Get data from Oracle
There exist many ways to get data from Oracle. After all, this is the core functionality of Oracle: getting data in and out. A nice possibility is to retrieve data by means of a small script. Such script is straightforward:

set echo off
set space 0
set pagesize 0
set termout off
set linesize 500
set colsep ‘’
set recsep off
–set recsepchar ‘’
set head off
set feedback off
set verify off
set flush off
spool c:/onzin.txt
select i||’;'||s from test;
spool off

The two most important lines are: “spool ” which designates a file that will receive data and the SQL that reads like “select etc”. Such script can be executed from SQLplus by @ “filename” where “filename” is the file that holds the script. In SQLplus, this looks like

SQL*Plus: Release 10.1.0.2.0 - Production on Thu May 26 22:27:54 2005
Copyright (c) 1982, 2004, Oracle. All rights reserved.
Connected to:
Oracle9i Release 9.2.0.1.0 - Production
JServer Release 9.2.0.1.0 - Production
SQL> @ “D:\My Documents\Desktop\ora1.txt”
SQL>

 When I go to the directory c:\, I see the file “onzin.txt”, that contains the data:

1;fu
2;bar
3; baz

I like this approach: you only need SQLplus as client application, which is standard in most circumstances. Moreover, it is fast: it depends on the network, but it is possible to retrieve millions of records within reasonable amount of time. Finally, this approach is repetable since the script can be stored in a file.

Final remark
It is possible to execute this approach in a scheduled environment. If we have a script (say a.bat) that looks like:

connect scott/tiger@tom8
@ “D:\My Documents\Desktop\ora1.txt”
exit

This script can be executed by a scheduler with subsequent command “sqlplus /nolog @a.bat”. Without human interference, SQLplus is started and within SQLplus, a script is executed that retrieves data. 

« Previous Page