Pig: yet another approach to handling big data

In another post, I discussed how Java can be used to analyse data in a Big Data environment. The problem then lies with Java itsself. Java is not a tool for the faint hearted; it is difficult. Moreover, one must comply with a structure where one must write two programme’s: a mapping programme and a reduce programme. These programmes communicate with a key, value pair. This structure might be too strict for the problem at hand.

Hence, Big Data development is difficult if one uses Java as a vehicle to undertake analysing Big Data.

Pig addresses these issues. This tool offers two advantages: it provides a relative simple language and it releaves the necessity to use the constraint of key, value pairs.

The language is relative simple to learn.

Let me show a simple programme that helped me to understand what Pig is all about. I used this dataset:

10001	42	07                                                                      
10020	42	07                                                                      
10031	42	08                                                                      
10011	42	08                                                                     
10051	42	09 

The programme is as follows:

A = LOAD '/infauser/ww-ii-data.txt' USING PigStorage('\t') AS (voorraad:int, year:int,lokatie:int);
describe A;
X = GROUP A by lokatie;
describe X;
B = FOREACH X GENERATE group AS lokatie, COUNT(A.voorraad) AS voorraad;

The first statement reads the records from the flat file. A structure is loaded that has tuples with 3 elements: voorraad, year and lokatie. The second line (describe) verifies that structure. Its’ output is A: {voorraad: int,year: int,lokatie: int}. The output is a tuple A with the three elements that was expected.
As a next step, the set of tuples is grouped by lokatie. The result can be seen in the output from the subsequent describe. This shows:

X: {group: int,A: {(voorraad: int,year: int,lokatie: int)}}

We have a tuple that consists of two levels. On one level, we have group and A. On a level beneath, we have A.voorraad, A.year, and A.lokatie. This implies that a subsequent step must use group and A.voorraad etc. In the subsequent step the lower level is aggregated via a “COUNT” clause. The final step then shows the results as they are stored in structure B: