Sqoop and Hive

It is possible to use Sqoop to directly load from a RDBMS into Hive. This opens interesting possibilities. Data that are stored in a RDBMS and that need to be analysed on a cheaper platform, can be migrated via Sqoop to a Hadoop platform. Sqoop is generaly seen as a reliable medium to undertake such migration. The command is straigthforward:

sqoop import --connect jdbc:mysql:// --table persons -m 1 --username thom --password thom24257  --hive-import

I noticed from the log that the import is done in three steps:

  • In a first step, the data are imported from the RDMBS and the data are stored as HDFS datasets.
  • The table is defined on Hive in the metadata store.
  • In the third┬ástep, the data are moved from the HDFS platform to the Hive Warehouse.

An alternative script is provided below:

sqoop import-all-tables \
    -m 1 \
    --connect jdbc:mysql://quickstart:3306/retail_db \
    --username=retail_dba \
    --password=cloudera \
    --compression-codec=snappy \
    --as-parquetfile \
    --warehouse-dir=/user/hive/warehouse \

This alternative shows how the HDFS datasets can be stored in a compressed format. Here snappy is used. The format is a parquet file. An alternative would be an avro file. That would be achieved by a “–as-avrodatafile \” in stead of “–as-parquetfile \”.

Another alternative is importing from Oracle that can be done as:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=" \
--table COUNTRIES \
--columns "NAME, LOCAL"  \
--num-mappers 1 \
--compression-codec=snappy \
--as-avrodatafile \
--username ANONYMOUS \
--password xxx99998 \
--warehouse-dir=/user/hive/warehouse \

Once the data are in the Hive metadatastore, they can be accessed via a jar-based JDBC connector or an ODBC connector. For the JDBC connector, I used Cloudera_HiveJDBC4_2.5.4.1006 that contains the necessary jars. For ODBC, I used ClouderaHiveODBC64.msi. Doing so, I could see the data in, say Excel: