Author: tom

  • Scala merging files

    In a previous post, I showed how two files can be merged in Scala. The idea was that RDDs were translated as data frames and a join was undertaken on these. In this post, the philosophy is slightly different. Now the RDD is rewritten as a key-value pair with a unique key. This then allows…

  • Scala and RDDs

    RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a…

  • Merging files in Scala

    I understand that Scala may be used in an ETL context. In ETL, an important element is the merge of two files. We will get data from different sources and they must be merged in one file only. As an example, we may think of two files, one containing a number and a name, another…

  • Getting a histogram from Big Data with Scala

    Scala can be used as a tool to manipulate big data. If it is used in the spark context, we have a possibility to combine two strong tools: spark with its possibility to bypass the MapReduce bottleneck and Scala with its short learning curve. The idea that Scala can be closely integrated with Spark is…

  • Scala

    Scala is a language that is used for general purposes. One may use it as a statistical tool, a tool to undertake pattern matching etc. Just like any other programming tool like Java, C++, Fortran might do. But on top of that, Scala is used as a means to steer Big Data on a Hadoop…

  • Network reaction from Python

    I have a php script that runs as cgi on a webserver. The programme is quite simple. First is asks for a userid and password. The userid and password are sent as a parameter. If these value coincide with expected value, the system returns a page where the user may click on a hyperlink to…

  • Another Pyspark scripts

    In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the…

  • A python script with many steps

    Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple. In the scheme below, some steps are shown that might be used. sc.textFile allow to read a…

  • The 1000th wordcount example

    I just discovered the 1000th wordcount example. It is based on Pyspark. The idea is actually quite simple. One creates a script. This script can be written in any editor. The programme can then be run from the terminal by spark-submit [programme]. As an example, one may start the programme below with: spark-submit –master yarn-cluster…

  • Joining files with Pyspark

    Pyspark allows us to process files in a big data/ Hadoop environment. I showed in another post how Pyspark can be started and how it can be used. The concept of Pyspark is very interesting. It allows us to circumvent the limitations of the mapreduce framework. Mapreduce is somewhat limiting as we have two steps:…