Partitioned Table in Hive

It is possible to partition the tables in Hive. Remember the data are stored in files. So we expect the files to be partitioned. This is accomplished by a split of the files over different directories. One directory serves one partition, a second another partition etc.
Let us take the example of 7 records that are split over 5 partitions according to their number. We have:

We see in the directory that corresponds to the file location, that 5 directories are present, each directory corresponding to one partition.

Let us switch to partition 2 that stores (2, ine), (2,ellen) and (2, henkjan). When the directory with the data on partition 2 is investigated, we see a file with the data within that directory:

With the data, the nummer that is used to create the partitions is omitted. The nummer is stored in the directory name.

To create this table, we use this command:

CREATE EXTERNAL TABLE fam_by_nummer(
naam STRING)
PARTITIONED BY (nummer INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/loudacre/fam_by_nummer';

To load the table, we take a detour.
First, the data are loaded to hdfs:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password bunvegni \
--table fam \
--columns "NUMMER, NAAM" \
--m 1 \
--target-dir /loudacre/fam_by_nummer_temp1;

Then a table is created:

CREATE EXTERNAL TABLE fam_by_nummer_temp(
nummer INT, naam STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/loudacre/fam_by_nummer_temp1';

And the data are loaded from this table into the partitioned table:

insert into table fam_by_nummer
partition (nummer)
select naam, nummer from fam_by_nummer_temp;

The exact sequence of the field wasn’t directly clear to me, but it could be derived from the Hue interface.
Adding data to an existing partition is easy.
First create a file and then upload it to the correct directory:

hadoop dfs -put /tmp/extra /loudacre/fam_by_nummer/nummer=2

Adding a new partition requires action on the Hive side:

ALTER TABLE fam_by_nummer
ADD PARTITION (nummer=6);

This must be added by the creation of the directory in the proper environment:

hadoop dfs -mkdir /loudacre/fam_by_nummer/nummer=6

And the storage of a file with data in that directory:

hadoop dfs -put /tmp/extra /loudacre/fam_by_nummer/nummer=6

Breaking

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe