Category: Uncategorized
-
Joining files with Pyspark
Pyspark allows us to process files in a big data/ Hadoop environment. I showed in another post how Pyspark can be started and how it can be used. The concept of Pyspark is very interesting. It allows us to circumvent the limitations of the mapreduce framework. Mapreduce is somewhat limiting as we have two steps:…
-
Flume: sending data via stream
It is possible to capture streaming data in HDFS files. A tool to do this is Flume. The idea is that we have 3 elements: sources that provide a stream, a channel that transports the stream and a sink where the stream ends in a file. This can already be seen if we look at…
-
Partitioned Table in Hive
It is possible to partition the tables in Hive. Remember the data are stored in files. So we expect the files to be partitioned. This is accomplished by a split of the files over different directories. One directory serves one partition, a second another partition etc. Let us take the example of 7 records that…
-
Manipulating Avro
Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java.…
-
Parquet format
As we know, we may store table definitions in the metastore. These table definitions then refer to a location where the data are stored. The format of the data might be an ordinary text file or it might be an avro file. Another possibility is a parquet file. This parquet format is an example of…
-
Avro format
In Hive, we see a situation where a table definition is stored in a metastore. This table definition is linked to a directory where the data are stored. It is possible to use different formats here. One may think of a text format. But other formats are possible too. One example is the avro format.…
-
Create a Hive table – 3 ways
In this little note, I want to show three different ways to create a table on Hive. The first one starts with a file on HDFS that is available and we create a table upon this file. This table is defined as an external file that is exposed as a table. The code to be executed…
-
Oracle ODI
The successor to OWB is the Oracle Data Integrator. This tool has more functionalities than OWB. Next to that, it has an interface that more or less steers the user through a series of steps. The idea is that one starts with a technical view where the file locations, databases and schemes are declared. Once…
-
Docker container
Only this weekend I downloaded a Docker package from https://docs.docker.com/docker-for-windows. This package allows you to run very small light weight containers on your server than act as components to perform a certain task. In a way, it looks like a virtual machine. It has no direct connect connection to the host machine and it runs…
-
Putting a file on HDFS
Putting a file on HDFS is relatively easy. There are a few steps to take. Let us assume the file is on a linux system. The first step is to copy the file to an area where it can be stored with the hdfs user as its owner. On my system, I have /tmp that…