Joining files by Pig

Door tom 30 augustus 2018

In the previous post, we used scala to merge two files. The interesting feature is that scala bypasses Mapreduce. Pig uses Mapreduce. If we undertake the same example, we will see a serious performance difference.

fooOriginal = LOAD '/user/prut/foo' USING PigStorage('|')
AS (id :long, foo:long);
barOriginal = LOAD '/user/prut/bar' USING PigStorage('|')
AS (id :long, bar:long);
joinedValues = JOIN fooOriginal by id, barOriginal by id;
store joinedValues into '/user/pig' USING PigStorage('|');

The end result will be written to a HDFS file that is stored in the directory /user/pig.
Of course, the results are the same; on the other hand, we see a serious performance difference. Reason being that under water, mapreduce is used.