Category: Uncategorized
-
Oh my God: how good is open source
A few days ago, I had to write a Python script that would enable us to write some data to an Oracle database. At first, I had no idea how to start. So, I downloaded a simple example script that I tried to run. The script was found via Google. This script looked like: import…
-
Curl and elasticSearch
One of the most useful utilities is “curl”. This wonderful tool can be used to transfer data from one platform to another. It is relatively easy to install in Windows, whereas under linux, it is often already installed. It must be run from the terminal in Linux or the command line in Windows. One example…
-
ElasticSearch
A new and popular nosql database is the Elastic Search database. This database is easy to install en easy to run. But is it easy to insert data and extract the outcomes? The principle of inserting data into ElasticSearch looks rather straight forward. One inserts json files. On the other hand, with filters, one may…
-
Scala merging files
In a previous post, I showed how two files can be merged in Scala. The idea was that RDDs were translated as data frames and a join was undertaken on these. In this post, the philosophy is slightly different. Now the RDD is rewritten as a key-value pair with a unique key. This then allows…
-
Merging files in Scala
I understand that Scala may be used in an ETL context. In ETL, an important element is the merge of two files. We will get data from different sources and they must be merged in one file only. As an example, we may think of two files, one containing a number and a name, another…
-
Getting a histogram from Big Data with Scala
Scala can be used as a tool to manipulate big data. If it is used in the spark context, we have a possibility to combine two strong tools: spark with its possibility to bypass the MapReduce bottleneck and Scala with its short learning curve. The idea that Scala can be closely integrated with Spark is…
-
Scala
Scala is a language that is used for general purposes. One may use it as a statistical tool, a tool to undertake pattern matching etc. Just like any other programming tool like Java, C++, Fortran might do. But on top of that, Scala is used as a means to steer Big Data on a Hadoop…
-
Another Pyspark scripts
In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the…
-
The 1000th wordcount example
I just discovered the 1000th wordcount example. It is based on Pyspark. The idea is actually quite simple. One creates a script. This script can be written in any editor. The programme can then be run from the terminal by spark-submit [programme]. As an example, one may start the programme below with: spark-submit –master yarn-cluster…