It is possible to partition the tables in Hive. Remember the data are stored in files. So we expect the files to be partitioned. This is accomplished by a split of the files over different directories. One directory serves one partition, a second another partition etc.
Let us take the example of 7 records that are split over 5 partitions according to their number. We have:
We see in the directory that corresponds to the file location, that 5 directories are present, each directory corresponding to one partition.
Let us switch to partition 2 that stores (2, ine), (2,ellen) and (2, henkjan). When the directory with the data on partition 2 is investigated, we see a file with the data within that directory:
With the data, the nummer that is used to create the partitions is omitted. The nummer is stored in the directory name.
To create this table, we use this command:
CREATE EXTERNAL TABLE fam_by_nummer( naam STRING) PARTITIONED BY (nummer INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/loudacre/fam_by_nummer';
To load the table, we take a detour.
First, the data are loaded to hdfs:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --username scott --password bunvegni \ --table fam \ --columns "NUMMER, NAAM" \ --m 1 \ --target-dir /loudacre/fam_by_nummer_temp1;
Then a table is created:
CREATE EXTERNAL TABLE fam_by_nummer_temp( nummer INT, naam STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/loudacre/fam_by_nummer_temp1';
And the data are loaded from this table into the partitioned table:
insert into table fam_by_nummer partition (nummer) select naam, nummer from fam_by_nummer_temp;
The exact sequence of the field wasn’t directly clear to me, but it could be derived from the Hue interface.
Adding data to an existing partition is easy.
First create a file and then upload it to the correct directory:
hadoop dfs -put /tmp/extra /loudacre/fam_by_nummer/nummer=2
Adding a new partition requires action on the Hive side:
ALTER TABLE fam_by_nummer ADD PARTITION (nummer=6);
This must be added by the creation of the directory in the proper environment:
hadoop dfs -mkdir /loudacre/fam_by_nummer/nummer=6
And the storage of a file with data in that directory:
hadoop dfs -put /tmp/extra /loudacre/fam_by_nummer/nummer=6