In this little note, I want to show three different ways to create a table on Hive. The first one starts with a file on HDFS that is available and we create a table upon this file. This table is defined as an external file that is exposed as a table.
The code to be executed in Hive is:
CREATE EXTERNAL TABLE fam (nummer SMALLINT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hdfs'
And we may say the content of the table from, say SQL Developer:
From the clause “LOCATION ‘/user/hdfs'”, we know that files that will deliver the data will be stored in directory /user/hdfs. If we want to add data to the table, we only have to copy files to that location. Let us start with two files:
We then copy an additional file to this environment ( hadoop dfs -put /tmp/fam3 /user/hdfs ). This shows:
When we then start the Hive environment: ( beeline -u jdbc:hive2://192.168.2.60:10000 -n training -p training ), we may issue the select command: select * from fam; . This provides the content of the fam table. One may see the content of the newly added fam3 table. This shows clearly how easy it is to add data to a Hadoop platform. The only thing we have to do is to add files to the correct environment. These files will be automatically recognised as part of a table.
An alternative to adding data to a table is to move data within the HDFS environment. Suppose we have a file on HDFS that contains data that should be added to the table. First, we start Hive:
beeline -u jdbc:hive2://192.168.2.60:10000 -n training -p training
I then moved data into the table:
LOAD DATA INPATH '/user/training/fam4' INTO TABLE fam;
I then saw the file being added to the data environment:
Another way is a direct import from, say Oracle, which creates on the fly a file on HDFS and which adds the table definition to Hive. The code is:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --username scott --password binvegni \ --table fam \ --columns "NUMMER, NAAM" \ --m 1 \ --fields-terminated-by ',' \ --table device \ --hive-import
Which creates a file on hdfs:
and a table in Hive that can be seen from, say Excel:
A slightly different script is this. Load data from Oracle into a file on hdfs:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --table COUNTRIES \ --columns "NAME, LOCAL" \ --num-mappers 1 \ --compression-codec=snappy \ --username ANONYMOUS \ --password ine13758 \ --warehouse-dir=/user/hive/warehouse
with a subsequent code in hive:
CREATE EXTERNAL TABLE COUNTRIES ( NAME STRING, LOCAL STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/hive/warehouse/COUNTRIES';
It is good to realise that we must undertake two actions:
1:An action to store the data (in a file, that is stored in a HDFS directory.
2: An action to create table definitions that are stored in a metastore.
A third manner is to use the HCatalog:
hcat -e "CREATE TABLE prut \ (id INT, company STRING) \ ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' \ LOCATION '/user/prut'"