In Hive, we see a situation where a table definition is stored in a metastore. This table definition is linked to a directory where the data are stored. It is possible to use different formats here. One may think of a text format. But other formats are possible too. One example is the avro format. This format can be characterised by a combination of file definition and the actual data.
Let us see how these data can be imported.
A possibility is to use the scoop command to import data in avro format:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --username scott --password binvegni \ --table fam \ --columns "NUMMER, NAAM" \ --m 1 \ --target-dir /loudacre/fam_avro \ --null-non-string '\\N' \ --as-avrodatafile
The result is composed of two parts:
1: The directory where the table content is stored. This can be found on HDFS in directory in /loudacre/fam_avro. This can be verified with “hadoop dfs -ls /loudacre/fam_avro”. This provides an overview of the files where the data are stored. When this is opened, one sees:
So, we have the data. The data definition is stored in the same directory as the directory where the scoop command was launched. To get this on HDFS platform, we issue ” hadoop dfs -put sqoop_import_fam.avsc /loudacre/”. This stored the definition on HDFS level. This can be investigated:
If we have the data (in directory /loudacre/fam_avro) and the definition (in file /loudacre/sqoop_import_fam.avsc), we may proceed by a definition in hive. Then hive is started by: “beeline -u jdbc:hive2://192.168.2.60:10000 -n training -p training”, followed by:
CREATE EXTERNAL TABLE fam_avro STORED AS AVRO LOCATION '/loudacre/fam_avro' TBLPROPERTIES ('avro.schema.url'= 'hdfs:/loudacre/sqoop_import_fam.avsc');
A check with “select * from fam_avro” showed the data as expected.