Category: Allgemein
-
Analysing JSON and database tables in Spark
In a previous note, I showed how CSV files can be analysed. One may use the same technique to analyse JSON files or tables in a database. First, analysing JSON files can be analysed with code that looks like: val jsonRDD = sc.wholeTextFiles(“/user/tom/baby_names.json”).map(x => x._2) val namesJson = sqlContext.read.json(jsonRDD) namesJson.registerTempTable(“names”) sqlContext.sql(“select * from names”).collect.foreach(println) Going…
-
Install Spark on windows
I found a beautiful YouTube movie that showed how Spark can be installed on windows. I found this on https://www.youtube.com/watch?v=WlE7RNdtfwE . The movie provided a clear guide how to this up. It provides a step by step approach. The first step is install JDK. I installed this from https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. This allowed me install version 8,…
-
The tango between Hive and Impala
An interesting tango exists between Hive and Impala. The situation is as follows. Hive acts as an layer upon map reduce. It provides an interface whereby table definitions can be stored in a so-called metadata store. See below: This metastore allows us to interpret directories on HDFS as tables. As they can be interpreted as…
-
Flume used for logs
In an earlier post, I showed how one may send a stream via netcat to hdfs using flume. Another possibility is to set up a stream that is received by a server whereby the data are directly shown. The idea is that the client starts with telnet or netcat whereby data are sent. On the…
-
Datamodelling in Hadoop
Before data modelling in Hadoop can be discussed, one needs to realise that Hadoop is about files. It is not about tables and relations – it is the files that is central. This implies that we have other means at our disposal than we have in a traditional RDMBS. As Hadoop is about files, we…
-
HBase
HBase is a database system that is built on top of HDFS. However the term ‘database’ might be a bit misleading. It is not a traditional SQL database that can be accessed by a traditional SQL type client, such as SQL developer. HBase is a technique that falls in the area of no SQL, –…
-
Write an AVRO file
Below, I provide some Python code to write an AVRO file. An AVRO file consists of a scheme and a set of records. The records are written in binary format. The scheme is as follows: {“type”: “record”, “name”: “StringPair”, “doc”: “A pair of strings.”, “fields”: [ {“name”: “left”, “type”: “string”}, {“name”: “right”, “type”: “string”}]} The…
-
Copying content of an Oracle table into an avro file
Below, you will find a listing on how to copy the content of an Oracle table into an avro file. The trick is quite straight forward. A table is read via a cursor. Each record is then appended to an avro file. The scheme is: { “namespace”: “example.avro”, “type”: “record”, “name”: “User”, “fields”: [ {“name”:…
-
Sending Avro file via HTTP
It is possible to send an AVRO file via HTTP. The idea is that one sets up a server process. Once the server process runs, a client call is made. I found a neat scheme how such process works. We see on the server side, a socket must be set up. This set-up must also…
-
Show content of an AVRO file with Python
This note describes how we can show the content of an AVRO file with Python. We use python3 as tool here. We use this from an Anaconda framework. I checked whether this installation already contained an AVRO package, but this wasn’t the case. Therefore, AVRO was downloaded (as avro-python3-1.8.2.tar.gz) and next command was issued: C:\ProgramData\Anaconda3\python.exe…