Scala and RDDs

Door tom 3 februari 2017

RDDs are the basic unit in Scala on Spark. The abbreviation stands for Resilient Distributed Dataset, This shows that we are talking on full data sets that are stored persistently on a distributed network. So the unit of work is comparable to a table. We have two different operations on this RDD. These are a filter, where some rows are left out and a map where the rows are manipulated. Below, I collected two sets of examples of scala statements on such RDD. This might then be used as a cookbook for future use.

The mapping functions:

val mydata_uc = mydata.map(line => line.toUpperCase())
val myrdd2 = myrdd1.map(pair => JSON.parseFull(pair._2).get.asInstanceOf[Map[String,String]])
val myrdd2 = mydata.map(line => line.split(' '))
val myrdd2 = mydata.flatMap(line => line.split(' '))
var pipo = logs.map(line => line.length)
var pipo = log.map(line => (line.split(' ')(0),line.split(' ')(2)))
val users = sc.textFile("/user/hdfs/keyvalue").map(line => line.split(',')).map(fields => (fields(0),(fields(0),fields(1)))) 
val users = sc.textFile("/user/hdfs/keyvalue").keyBy(line => line.split(',')(0))
val counts = sc.textFile("/user/hdfs/keyvalue").flatMap(line => line.split(',')).map(fields => (fields,1))
val counts = sc.textFile("/user/hdfs/keyvalue").flatMap(line => line.split(',')).map(fields => (fields,1)).reduceByKey((v1,v2) => v1+v2)
val avglens = sc.textFile("/user/hdfs/naamtoev").flatMap(line => line.split(" ")).map(word => (word,word.length)).groupByKey().map(pair => (pair._1, pair._2.sum/pair._2.size.toDouble))

The filter function:

val mydata_uc = mydata.filter(line => line.startsWith("I"))
var jpglogs = logs.filter(line => line.contains(".jpg"))

Door tom

data warehousing

Breaking

Scala and RDDs

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe

Scala and RDDs

Door tom

Gerelateerd bericht

Oracle create loader file

Useful Scala Programme

Joining files by Pig

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe