Processing CSV files in Spark

It is possible to process CSV files in Spark as if these files are tables. That looks very promising: reading a file, translating the file into something that can be seen as a table and subsequently applying SQL to this. We need some additional jars however. First, we need to download 2 jars: http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 and http://mvnrepository.com/artifact/org.apache.commons/commons-csv/1.2. I stored these files in SPARK_HOME\lib. In that directory, I saw other jars as well. I then started spark from SPARK_HOME with ./bin/spark-shell –jars lib/spark-csv_2.10-1.5.0.jar,lib/commons-csv-1.2.jar. This then includes the jars that were just downloaded.
The inclusion of these jars provide a so-called SQL context (sqlContext).
I could then issue two lines of code:

  • val df = sqlContext.read.format(“com.databricks.spark.csv”).option(“inferScheme”,”true”).option(“header”,”true”).load(“/user/tom/baby_names.csv”);
  • df.registerTempTable(“names”)

In this situation a file (/user/tom/baby_names.csv) can be approached as a table (names). One may then apply SQL to this:

  • val Som = sqlContext.sql(“select sum(Int(Count)) from names”)

Som is a dataframe. One may display the content with Som.collect.foreach(println)

Door tom