In this note, I show yet another Pyspark with slightly different methods to filter. The idea is that file is read in a RDD. Subsequently, it is cleaned. That cleaning process involves a removal of lines that are too long. The lines are split with a character that is on the twentieth position. Then the lines with a number of elements unequal to 14 are filtered out. After that, some columns are kept. Finally, the values are concatenated into a line and written to disk.
We have:
devstatus = sc.textFile("/loudacre/devicestatus.txt") # Filter out lines with < 10 characters, use the 20th character as the delimiter, parse the line, and filter out bad lines cleanstatus = devstatus. \ filter(lambda line: len(line) > 20). \ map(lambda line: line.split(line[19:20])). \ filter(lambda values: len(values) == 14) # Create a new RDD containing date, manufacturer, device ID, latitude and longitude devicedata = cleanstatus. \ map(lambda words: (words[0], words[1].split(' ')[0], words[2], words[12], words[13])) # Save to a CSV file as a comma-delimited string (trim parenthesis from tuple toString) devicedata. \ map(lambda values: ','.join(values)). \ saveAsTextFile("/loudacre/devicestatus_etl")