Author: tom
-
Useful Scala Programme
I saw a small Scala programme that allows you to calculate subtotals. The idea is that a flat file is provided with a name and a subtotal. A given namen can occur more than one, thus providing a subtotal more than once. The question is to calculate the total per name. Let me provide some…
-
The tango between Hive and Impala
An interesting tango exists between Hive and Impala. The situation is as follows. Hive acts as an layer upon map reduce. It provides an interface whereby table definitions can be stored in a so-called metadata store. See below: This metastore allows us to interpret directories on HDFS as tables. As they can be interpreted as…
-
Joining files by Pig
In the previous post, we used scala to merge two files. The interesting feature is that scala bypasses Mapreduce. Pig uses Mapreduce. If we undertake the same example, we will see a serious performance difference. fooOriginal = LOAD ‘/user/prut/foo’ USING PigStorage(‘|’) AS (id :long, foo:long); barOriginal = LOAD ‘/user/prut/bar’ USING PigStorage(‘|’) AS (id :long, bar:long);…
-
Joining files by Scala
In this note, I will provide another script to join two files. The files are foo and bar. They contain lines whereby the elements are separated by a bar (|). One example of such line is 1|110. So the first step is to split the lines. Then the file is indexed on one element. Subsequently,…
-
Flume used for logs
In an earlier post, I showed how one may send a stream via netcat to hdfs using flume. Another possibility is to set up a stream that is received by a server whereby the data are directly shown. The idea is that the client starts with telnet or netcat whereby data are sent. On the…
-
Datamodelling in Hadoop
Before data modelling in Hadoop can be discussed, one needs to realise that Hadoop is about files. It is not about tables and relations – it is the files that is central. This implies that we have other means at our disposal than we have in a traditional RDMBS. As Hadoop is about files, we…
-
HBase
HBase is a database system that is built on top of HDFS. However the term ‘database’ might be a bit misleading. It is not a traditional SQL database that can be accessed by a traditional SQL type client, such as SQL developer. HBase is a technique that falls in the area of no SQL, –…
-
Write an AVRO file
Below, I provide some Python code to write an AVRO file. An AVRO file consists of a scheme and a set of records. The records are written in binary format. The scheme is as follows: {“type”: “record”, “name”: “StringPair”, “doc”: “A pair of strings.”, “fields”: [ {“name”: “left”, “type”: “string”}, {“name”: “right”, “type”: “string”}]} The…
-
Copying content of an Oracle table into an avro file
Below, you will find a listing on how to copy the content of an Oracle table into an avro file. The trick is quite straight forward. A table is read via a cursor. Each record is then appended to an avro file. The scheme is: { “namespace”: “example.avro”, “type”: “record”, “name”: “User”, “fields”: [ {“name”:…
-
Sending Avro file via HTTP
It is possible to send an AVRO file via HTTP. The idea is that one sets up a server process. Once the server process runs, a client call is made. I found a neat scheme how such process works. We see on the server side, a socket must be set up. This set-up must also…