Python: another language to access Big Data

In an earlier post, I showed how Java could be used to access Big Data. I also stated that I had many problems with Java itsself. I noted that I was not the only one to have issues with Java. A much easier language is Python. This language is really easy to learn and it can be used in more or less the same situation as the Java programme.

The mapper step is implemented with a Python programme that contains:

for line in sys.stdin:
  val = line.strip()
  (year, temp) = (val[7:9], val[36:39])
  try:
     if ((type(int(year)) == type(int(7))) and (type(int(temp)) == type(int(7)))):   
        print "%s\t%s" % (year, temp)

.
This part retrieves a year and temperature from the datastream. It checks as to whether the values are actually temperatures and years (they may contains values from a header or empty values).
The summarising reduce job is implemented with:

   (last_key, max_val) = (key, max(max_val, int(val)))

These jobs are executed with:

hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
 -input /user/ww-ii-data.txt \
 -output /user/output31 \
 -mapper /home/hduser/max_temperature_map.py \
 -reducer /home/hduser/max_temperature_reduce.py

The first part of this command translates the file that is stored in hadoop into a stream. The stream is split over different servers. On each server the lines are investigated to retrieve year and temperature. This is done with the programme that is mentioned in the mapper clause. Subsequently, the results are merged with the programme that is mentioned in the reducer clause.
The reason to use Python instead of Java is the user friendliness of Python, compared to Java. It took me 7 days to get the Java programme running; whereas the Python programme only took me a few hours to write. My knowledge of Java is very limited; the same holds for Python. As Python is much easier to use, the time gain was enormous. This coincides with other experiences where Python was much preferred over Java.

Door tom