Add data on Big Data in Hive and Impala

This post provides info on how data may be added on a Big Data platform with help of Hive and Impala. We start with a dataset that is stored on a Linux platform. We will show how these data can be stored on a HDFS/ Hadoop platform. After that, we will show how these data can be included in a Hive database. Subsequently it is shown how these data can be accessed from a client (like SQL Developer).
But the first step to have data on a Linux platform. Such data can be inserted via FTP or via de text editor like nano or vi. Let us assume we create a file with nano, inserting several comma-seperated records. I created a 5 record-file with two fields:

[cloudera@quickstart ~]$ nano invoer3.txt
[cloudera@quickstart ~]$ cat invoer3.txt
[cloudera@quickstart ~]$

The data are now on a Linux platform. They have to be moved to the HDFS platform. This can be be done with:

cat invoer3.txt|hdfs dfs -put - /user/cloudera/invoer/invoer3.txt

This command streams the data (using cat) and puts that stream into a file on the HDFS platform. To name the data lineage easy, I use the same name for the file as on Linux. In both cases, it is invoer3.txt. However, these are two physically different files. Once they are stored on the hfds platform, I would like to set the privilegs such that all users may access them. I use these commands to accomplish this:

sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/invoer
sudo -u hdfs hadoop fs -chmod 777 /user/cloudera/invoer/invoer3.txt

In subsequent steps, the files will be read by Hive/ Impala which is technically another user. I created privileges which allow other users to read the data and to create other files if necessary.
I will subsequently create a table via Impala, using the HDFS file as an external file:

[cloudera@quickstart ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.2.0-cdh5 RELEASE (build 2ffd73a4255cefd521362ffe1cfb37463f67f75c)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Shell build version: Impala Shell v2.2.0-cdh5 (2ffd73a) built on Tue Apr 21 12:09:21 PDT 2015)
[quickstart.cloudera:21000] > create EXTERNAL TABLE invoer3
id INT,
LOCATION '/user/cloudera/sample_data/invoer3.txt';
Fetched 0 row(s) in 0.08s
[quickstart.cloudera:21000] > 

In Impala, a file is indicated as an external file that provides the data for the table. In Hive, a similar exercise is possible, see:

[cloudera@quickstart ~]$ hive

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.4.0.jar!/
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
    > (
    >    id INT,
    >    naam STRING
    > )
    > LOCATION '/user/cloudera/invoer';
Time taken: 0.772 seconds

However, in that case, a directory is used to indicate where the data come from.

A slightly differnt way is to use the load command that loads data directly into a Hive default database. Let us assume, we have on Linux a file that is called invoer4.txt and which reads:

[cloudera@quickstart ~]$ cat invoer4.txt
2	tom
3	ine
[cloudera@quickstart ~]$ 


This file can be loaded into a Hive database in two steps:
Step1: start Hive and create a table.
Step2: load the data into the Hive table.

hive> create table invoer4(k string, v string)row format delimited fields terminated by '\t'  stored as textfile;
Time taken: 0.785 seconds
hive> load data local inpath '/home/cloudera/invoer4.txt'  into table invoer4;
Loading data to table default.invoer4
Table default.invoer4 stats: [numFiles=1, totalSize=12]
Time taken: 0.603 seconds

In that case, the data are stored in a table – not as an external file as above.
This can be assessed if we see on the hdfs platform a file like: /user/hive/warehouse/invoer4/invoer4.txt. This is a copy that results from load command as given by hive.

As the data are now defined in Hive/ Impala, it is possible to access these data with a client that has a Hive connector. An example is the Oracle SQL Developer, that has a Hive connector. Once installed, one may access this file, see:
In this case, I used ordinary SQL to access the table and to fetch the records.

I realise, I use Hive and Impala as synonymns to each other, where they are in reality two different tools. Hive uses the map reduce logic to derive the results, while Impala is able to circumvent this bottleneck by their own deamons. As a result, Impala returns the results faster than Hive. Impala has been designed from its start as a better and faster tool than Hive. Hive (like Pig) is build on Map Reduce. Impala is completely new and is not built on Map Reduce. More information can be found in a wonderful article from Cloudera on Hive and Impala.