Recently, I revisited Pig. Pig is a language that allows you to analyse data sets in a Hadoop environment. It is created at Yahoo to circumvent the technicalities of creating a MapReduce Java job. Yahoo claims that most of her queries on a Hadoop platform can be replaced by a Pig script. As Pig is much easier to write, this leads to much time gains.
After some study, I realised that two concepts are central in Pig: the concept of a tuple that corresponds roughly to a line and bag that stands for a set of within that line.
Let us analyse a Pig script. The script reads as:
A = load '/user/gpadmin/testtom/data-00000' USING PigStorage(',') as (id:int,name:chararray); B = foreach A generate id, name, 1 as een; C = group B by een; D = foreach C generate group, AVG(B.id) as gem; dump D;
The first line reads a dataset. It generates a set of tuples, where each tuple has two variables: id and name. One could think of different lines, where each line contains an id and a name.
(1,tom) (2,ine) (3,paula) (4,stella) (5,bart)
The second line goes through the set of tuples from A and adds to each tuple a variable een that has one value: 1. Hence each tuple has three variables: id, name and een. Hence, we have different lines. Each line contains an is, name and een. We have:
(1,tom,1) (2,ine,1) (3,paula,1) (4,stella,1) (5,bart,1)
The third line creates a new tuple. This tuple is based on the value for een. Hence, each line contains data that relate to one value of een. Within the tuple, we have a bag where tuples are collected from the source.
Here, 1 is the value for een, upon with the grouping is based. Within een=1, we have a bag that contains all source tuples that are related to een=1.
We may proceed by applying a summary function upon that bag. That is done in the last line where an average is taken. For each value of een, an average on id is calculated. This is done in the fourth line.
The end result is:
2015-10-04 21:34:59,868 [main] INFO org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code. 2015-10-04 21:34:59,872 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 2015-10-04 21:34:59,872 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 (1,3.0) grunt>
The end result is 1 (grouping value) and 3 (the average of id within the bag).
Another Pig program:
A = load '/user/gpadmin/name.txt' USING PigStorage(',') as (name:chararray,age:int); B = filter A by age > 15; C = load '/user/gpadmin/drink.txt' USING PigStorage(',') as (klant:chararray,drank:chararray); D = join B by name, C by klant; E = foreach D generate name, drank, 1 as een; F = group E by name; G = foreach F generate group, COUNT(E.een) as totaal; dump G;
This programme reads two files from an HDFS platform. It then inner joins the two files. After that, bags are created. These bags can be indicated by E.een, E.name, E.drank. These bags are grouped by name.