It is possible to use Sqoop to directly load from a RDBMS into Hive. This opens interesting possibilities. Data that are stored in a RDBMS and that need to be analysed on a cheaper platform, can be migrated via Sqoop to a Hadoop platform. Sqoop is generaly seen as a reliable medium to undertake such migration. The command is straigthforward:
sqoop import --connect jdbc:mysql://62.131.51.999/tom --table persons -m 1 --username thom --password thom24257 --hive-import
I noticed from the log that the import is done in three steps:
- In a first step, the data are imported from the RDMBS and the data are stored as HDFS datasets.
- The table is defined on Hive in the metadata store.
- In the third step, the data are moved from the HDFS platform to the Hive Warehouse.
An alternative script is provided below:
sqoop import-all-tables \ -m 1 \ --connect jdbc:mysql://quickstart:3306/retail_db \ --username=retail_dba \ --password=cloudera \ --compression-codec=snappy \ --as-parquetfile \ --warehouse-dir=/user/hive/warehouse \ --hive-overwrite
This alternative shows how the HDFS datasets can be stored in a compressed format. Here snappy is used. The format is a parquet file. An alternative would be an avro file. That would be achieved by a “–as-avrodatafile \” in stead of “–as-parquetfile \”.
Another alternative is importing from Oracle that can be done as:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --table COUNTRIES \ --columns "NAME, LOCAL" \ --num-mappers 1 \ --compression-codec=snappy \ --as-avrodatafile \ --username ANONYMOUS \ --password xxx99998 \ --warehouse-dir=/user/hive/warehouse \ --hive-overwrite
Once the data are in the Hive metadatastore, they can be accessed via a jar-based JDBC connector or an ODBC connector. For the JDBC connector, I used Cloudera_HiveJDBC4_2.5.4.1006 that contains the necessary jars. For ODBC, I used ClouderaHiveODBC64.msi. Doing so, I could see the data in, say Excel: