In my impression, the big development that takes place now in the world of Big Data is the creation of connectors. Such connectors enable us to continue using standard tools (R for example) with the data being stored in Hadoop. I am very much impressed with Hive. Hive allows us to access data being stored in Hadoop while we use a standard SQL tool.
As an example, I use Oracle SQL Developer as a SQL tool. My aim will be to access from SQL Developer a dataset that is stored on Hadoop.
Of source, SQL Developer needs to be provided with a couple of drivers to access Hadoop. Such drivers can be found on the Cloudera; Mirror Cloudera_HiveJDBC_184.108.40.2066. These drivers can be stored in SQL Developer. SQL Developer allows this via Tools>Preferences>Database>Third Party JDBC Driver , then select [Add Entry]. Now the client is ready. All drivers are included. One may add a connection via “Hive”, much like any other connection.
On the other hand, the server must have a Hive server running. If one uses a Cloudera distribution, Hive is already included. If Hive is not running, one should install the Hive server. If that is ready, the tables need to added to the Hive metastore – Hive needs to be made aware of the existence of the tables.
Adding a table to Hive metastore is done in 2 steps. The first step is to define a table. An example might be:
CREATE TABLE TableName3 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
This can be started from the client. Within the SQL Developer, one may fire such SQL to the Hive environment.
It is a bit unfortunate that it is not possible to use an insert statement. We now have to load data on the server directly. This can be done at the server to start the hive tool. Just type in “hive” and the system responds with hive>. A command to load data is “LOAD DATA LOCAL INPATH ‘/home/cloudera/a.txt’ INTO TABLE tablename3;”. This loads a text file (“a.txt”) into th table that is created. The fields in this file are seperated with a comma and the records are limited by an “end of line” character. This is part of the table definition.
It can be checked how the data are stored on the Big Data hadoop platform. With the standard statements, it is possible to verify where the data are. In my case they were stored in /user/hive/warehouse/ that was verified with “hadoop dfs -ls /user/hive/warehouse/tablename3″. The actual data can then be accessed via: ” hadoop dfs -cat /user/hive/warehouse/tablename3/a.txt”.