The tango between Hive and Impala

Door tom 2 september 2018

An interesting tango exists between Hive and Impala.
The situation is as follows. Hive acts as an layer upon map reduce. It provides an interface whereby table definitions can be stored in a so-called metadata store. See below:

This metastore allows us to interpret directories on HDFS as tables. As they can be interpreted as tables, we may apply SQL on them. The SQL statements are somehow translated into a map reduce procedure and they return results as we expect from SQL.
The advantage is obvious: we may use SQL instead of writing map reduce programmes.
However, as SQL statements are translated into map reduce items, we stick to the issue that map reduce regularly writes data to disk and it is somewhat slow in starting.
Impala circumvents this map reduce bottlenecks. It nevertheless uses the Hive SQL dialect and it also uses the Hive metadata store. So it is relatively easy to switch from Hive to Impala.
But the idea to circumvent Mapreduce bottlenecks is exactly what is undertaken in the newer versions of Hive. Hive now uses Tez is stead of of map reduce. This makes Hive faster than Impala. So we may return from Impala to Hive.

Etc. To be continued.

Door tom

Allgemein

Breaking

The tango between Hive and Impala

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe

The tango between Hive and Impala

Door tom

Gerelateerd bericht

Calculate elapse period in Teradata

A pivot table in Teradata

Using the SAS Viya environment

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe