Skewness in teradata

Currently, I work on a Teradata machine. I understand that Teradata is based on the idea that processes work in parallel with each other. The idea is that data are divided over so-called amps that each process their share of the data.
If each so-called amp may process their share of data, the amount of data must be equally divided over the amps. If each amp take proportional part of data, each amp has a fair share of work. As the last amp decides on total elapse time, it is advisable to have an equal amount of data allocated to each amp.
To see as to what extent the data are equally divided, one may run this query:

SELECT databasename, tablename, creatorname
,dat_created*1 + 19000000 AS dat_created
,MAX(amps) AS amps
,MAX(R) AS Ratio
,MAX(AvgMB) AS AvgMB
,MAX(MaxMB) AS MaxMB
FROM (
SELECT t.tablename
, t.databasename
, ts.vproc
, ts.currentperm/1000000.0  AmpMB
, t.creatorname 
,CAST (t.createtimestamp AS DATE) AS dat_created
,AVG(ts.currentperm/1000000.0) OVER 
 (PARTITION BY ts.databasename, ts.tablename) AS AvgMB
,MAX(ts.currentperm/1000000.0) OVER 
 (PARTITION BY ts.databasename, ts.tablename) AS MaxMB
,COUNT(*) OVER (PARTITION BY ts.databasename, ts.tablename) AS AMPS
,CAST ( 10000* (maxmb-avgmb)/maxmb AS INTEGER )  AS R
FROM dbc.tablesize ts JOIN  dbc.tables t 
ON ts.databasename = t.databasename AND ts.tablename = t.tablename
WHERE
 t.databasename  IN  ('financial') 
) MY_DISPLACED
GROUP BY 1,2,3,4
ORDER BY Ratio DESC;

The higher, the ratio R, the more skewed the division of data over the amps is.
One influences the division of data with the addition of “primary index(attribute name)” in the ddl that creates the table.
Records that have the same value for the attribute are then allocated to the same amp. This implies that the primary index should have a many different values as possible. Different values implies different amps that get records allocated to.
An example. Table “slecht” is divided over two amps. One amp contains 0.08 MB, the second contains 0.08 MB. This is nicely distributed and the SQL that runs on this table will run smooth. If we look at the values, we notice that slecht has a primary index refers to an attribute that has many different values with no value standing out. If we update the table “slecht”by giving the primary index attribute the same value, we see that one amp gets 0.16 MB whereas the other amp gets no records at all. In that case, the sql will take more time as only one amp must do all work.

Breaking

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe