With Python in Hive

Door tom 28 november 2016

In this small note, it is described how an HDFS file can be stored in a Hive context. In it stored in a Hive context, it can be accessed from outside via ODBC. It is also possible to access the data as a SQL compliant database. The idea is that an abstraction is created on top of the HDFS datasets. One may then access the HDFS datasets, much like an ordinary database.
We will use the python language via spark. This avoids the bottleneck that MapReduce has created.
One starts python via spark with the command “pyspark”. If everything goes correct, we see:

Two variables are important: sc that is an anchor point for methods that can be used within Spark and HiveContext that be used as a starting point for Hive methods.

We first import the relevant libraries and create the context:

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Then the table is defined:

sqlContext.sql("CREATE TABLE IF NOT EXISTS HiveTom (key STRING, value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ';'")

In the last step, an existing HDFS file is connected to that table definition:

sqlContext.sql("LOAD DATA INPATH 'hdfs:/Chapter5/uit2' INTO TABLE HiveTom")

We may now approach this dataset as a table. The tablename is HiveTom. A possibility is to access the table via ODBC. We can download an ODBC connector. Each distribution (Cloudera, MapR, Hortonworks) has a ODBC connector. Once installed, we may retrieve the data in a ODBC compliant tool. As example, we may undertake this in Ecel:

Door tom

nice to know

Breaking

With Python in Hive

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe

With Python in Hive

Door tom

Gerelateerd bericht

Oracle Aggregate and analytic functions

Inserting a BLOB

Oracle numerical data

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe