Hadoop: my first java programme

Today, I created a Java programme to get myself acquainted with the usage of Hadoop. I took an existing java programme to start with. This existing programme can be found at ” https://github.com/tomwhite/hadoop-book/blob/master/ch02/src/main/java/OldMaxTemperature.java “. I tweaked this programme to adjust it to my existing situation.

The idea is as follows. My situation is that I have a very long file of temperatures. I want to retrieve the maximum temperature that is mentioned, grouped by year. The concept of Big Data is to split this task over different servers. Each server then takes a chunk of the file. Within each chunk, the relevant temperature is extracted. After that, the results are taken together to calculate the maximum temperature.
Within the java programme, one sees the first step being taken as follows:

	String year = line.substring(7,9);
	String tussen = line.substring(36, 39).trim();
	if (isInteger(tussen))
		{
		int airTemperature = Integer.parseInt(tussen);      
	        /*[*/output.collect/*]*/(new Text(year), new IntWritable(airTemperature));
		} ; 

This step selects a year and a temperature within its chunk of the file. The year and temperature is subsequently written as a key, value pair to the next stage. Within this step, logic is implemented that is executed at line level within the file. One may see that such logic can be implemented in parallel as it doesn’t depend on other lines or other parts of the file.
The key, value pairs are subsequently sent to a second part within the programme. In that part the maximum is taken from the key, value pairs. The code in that second part is:

      int maxValue = Integer.MIN_VALUE;
      while (/*[*/values.hasNext()/*]*/) {
        maxValue = Math.max(maxValue, /*[*/values.next().get()/*]*/);
      }
      /*[*/output.collect/*]*/(key, new IntWritable(maxValue));
    }

The first part is called a mapping which can be run in parallel. The second part is called a reduce that is aimed at collecting all intermediate results (the key, value pairs).
The mapping/ reduce task is a general task that is included in the programme as an extension of objects. Hence one must import existing java classes into the programme. This can be seen in the import statements in the beginning of the programme and the compiler statement. It runs as “javac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d dMaxTemperature OldMaxTemperature1.java”. The result is having a mapping class and a reduce class that undertakes the two steps. It will be combined in one container with “jar -cvf /home/hduser/OldMaxTemperature1.jar -C dMaxTemperature/ .”. One then has a container that may be used as an executable to implement the task. This will be done with “/usr/local/hadoop/bin/hadoop jar /home/hduser/OldMaxTemperature1.jar OldMaxTemperature1 /user/ww-ii-data5.txt /user/output22”.

After these steps, the results can be seen in

40	95
41	97
42	106
43	117
44	122
45	121

It took me quite a while to get these results. Much time got lost as my knowledge of Java is a bit flimsy.
I am not the only one who both appreciates the possibilities of Java, while one is also hampered by the technical problems. Many efforts are spent on making Big Data easier to use.