Another Pyspark scripts

Door tom 5 januari 2017

In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the lines with a number of elements unequal to 14 are filtered out. After that, some columns are kept. Finally, the values are concatenated into a line and written to disk.
We have:

devstatus = sc.textFile("/loudacre/devicestatus.txt")

# Filter out lines with < 10 characters, use the 20th character as the delimiter, parse the line, and filter out bad lines
cleanstatus = devstatus. \
    filter(lambda line: len(line) > 20). \
    map(lambda line: line.split(line[19:20])). \
    filter(lambda values: len(values) == 14)
    

# Create a new RDD containing date, manufacturer, device ID, latitude and longitude
devicedata = cleanstatus. \
    map(lambda words: (words[0], words[1].split(' ')[0], words[2], words[12], words[13]))

# Save to a CSV file as a comma-delimited string (trim parenthesis from tuple toString)
devicedata. \
    map(lambda values: ','.join(values)). \
    saveAsTextFile("/loudacre/devicestatus_etl")

Door tom

Uncategorized

Oh my God: how good is open source

tom 30 juli 2017

Uncategorized

Curl and elasticSearch

tom 21 februari 2017

Uncategorized

ElasticSearch

tom 20 februari 2017

Breaking

Another Pyspark scripts

Door tom

Gerelateerd bericht

Je miste

Flask and JSON

A webserver from the command line

Use the node.js server as restful app server

Reading a CSV file and translate into dataframe