Hive, SQL on Hadoop

In a previous post, I discussed the difficulty to use Hadoop with its Big Data structure. One must write two different Java programmes. One programme is a so-called mapping programme; another is the reduce programme.

The problem one has with Java coding can be avoided with Pig. This is a relative simple scripting tool that provides a user means to access a Big Data environment.
However, is can be simpler. Hive is a tool that allows using SQL on a Big Data structure. This is really super, sexy. One has a (more or less) standard set of SQL statement that be used to undertake the analysis. As an example, I will show how an external table can be created that be analysed.
I will start two terminal sessions that access the Big Data environment. The first session is used to manipulate the hadoop file structure:
terminal.
The second session is started from the command line. It can be started from the command line with the command „hive“. The provides the user a (CLI) interface to hive. hive.

The first step is to create a file within the first session. On the standard linux environment, a file is created that contain data that will be stored in a table. This file looks like:

[infauser@sandbox ~]$ cat wwdata2
10001	42	07                                                                      
10020	42	07                                                                      
10031	42	08                                                                      
10011	42	09                                                                     
10051	42	09 
[infauser@sandbox ~]$ 

I then switch to the second session to create the metadata for an external table. I created an external table on the hive command line interface:

hive> create external table wwdataEXT3(voorraad INT, year INT, lokatie STRING)          
    > row format delimited                                                 
    > fields terminated by '\t'
    > stored as textfile
    > location '/data/wwdata3';
OK
Time taken: 0.059 seconds

At this moment, we cannot access the data. We do have a reference to a file on the HDFS environment (‚/data/wwdata3‘), but this does not yet contain data. We only have dataset on the Linux file system. We must copy the data from Linux filesystem to the HDFS environment. This copy command can be given in the first session. This can be done with:

[infauser@sandbox ~]$ hadoop dfs -put /infauser/wwdata2 /data/wwdata3
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

We may check as to whether the actions are successfull when a command is fired. This can done from the second session. From the hive prompt we start a SQL statement:

hive> select * from wwdataEXT3;
OK
10001	42	07                                                                      
10020	42	07                                                                      
10031	42	08                                                                      
10011	42	09                                                                     
10051	42	09 
Time taken: 0.203 seconds, Fetched: 5 row(s)

The concept is that the SQL statements are translated into a series of Java programmes that will be executed. This can be seen when the output is investigated. Within the output, terms like „mapreduce“ can be clearly seen.