Category: data warehousing

  • Oracle create loader file

    Creating a loader file is tedious. The syntax is complicated. Moreover, small errors lead to a rejection of records. Luckily, we have the possibility to generate them from SQL Developer. One possibility is to use the SQL Loader. This facility has an import/ export module that allows to create such files. If one would like…

  • Useful Scala Programme

    I saw a small Scala programme that allows you to calculate subtotals. The idea is that a flat file is provided with a name and a subtotal. A given namen can occur more than one, thus providing a subtotal more than once. The question is to calculate the total per name. Let me provide some…

  • Joining files by Pig

    In the previous post, we used scala to merge two files. The interesting feature is that scala bypasses Mapreduce. Pig uses Mapreduce. If we undertake the same example, we will see a serious performance difference. fooOriginal = LOAD ‘/user/prut/foo’ USING PigStorage(‘|’) AS (id :long, foo:long); barOriginal = LOAD ‘/user/prut/bar’ USING PigStorage(‘|’) AS (id :long, bar:long);…

  • Joining files by Scala

    In this note, I will provide another script to join two files. The files are foo and bar. They contain lines whereby the elements are separated by a bar (|). One example of such line is 1|110. So the first step is to split the lines. Then the file is indexed on one element. Subsequently,…

  • OBIEE First report

    Oracle has a very nice reporting tool, called OBIEE. It is positioned on top of their database. This allows to exploit the data. To do so, a separate (Weblogic OBIEE) server is created that processes the data for reporting purposes. So on the server side, at least two server processes are running: the DBMS and…

  • ElasticSearch: Restful services

    As we have seen in a previous post, we communicate with the ElasticSearch server via messages that are sent to a server. On the other hand, the server responds in messages that are received by the client. This system of messages are labelled as s “RESTful” structure. This RESTful structure is based om messages that…

  • Scala and RDDs

    RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a…

  • A python script with many steps

    Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple. In the scheme below, some steps are shown that might be used. sc.textFile allow to read a…

  • Dataflow in Oracle Warehouse Builder

    I know that Oracle Warehouse Builder (OWB) is at end of life. On the other hand, I encounter OWB quite often and it is interesting to see how it works. So investigate how it works, I created a dataflow. It it a trivial one: it consists of a file that must be read into Oracle.…

  • reading an HDFS file in Python

    In this note, I show you how to get data from an HDFS platform into a Python programme. The idea is that we have data on HDFS and we would like to use these data in a Python programme. So, we must connect to HDFS from within a Python programme, read the data , transform…