Scala and RDDs

RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a filter, where some rows are left out and a map where the rows are manipulated. Below, I collected two sets of examples of scala statements on such RDD. This might then be used as a cookbook for future use.

The mapping functions:

val mydata_uc = => line.toUpperCase())
val myrdd2 = => JSON.parseFull(pair._2).get.asInstanceOf[Map[String,String]])
val myrdd2 = => line.split(' '))
val myrdd2 = mydata.flatMap(line => line.split(' '))
var pipo = => line.length)
var pipo = => (line.split(' ')(0),line.split(' ')(2)))
val users = sc.textFile("/user/hdfs/keyvalue").map(line => line.split(',')).map(fields => (fields(0),(fields(0),fields(1)))) 
val users = sc.textFile("/user/hdfs/keyvalue").keyBy(line => line.split(',')(0))
val counts = sc.textFile("/user/hdfs/keyvalue").flatMap(line => line.split(',')).map(fields => (fields,1))
val counts = sc.textFile("/user/hdfs/keyvalue").flatMap(line => line.split(',')).map(fields => (fields,1)).reduceByKey((v1,v2) => v1+v2)
val avglens = sc.textFile("/user/hdfs/naamtoev").flatMap(line => line.split(" ")).map(word => (word,word.length)).groupByKey().map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble)) 

The filter function:

val mydata_uc = mydata.filter(line => line.startsWith("I"))
var jpglogs = logs.filter(line => line.contains(".jpg"))