Pyspark is the python language that is applied to spark. It therefore allows a wonderful merge between spark with its possibilities to circumvent the limitation that are set by the mapreduce framework and python that is relatively simple.
In the scheme below, some steps are shown that might be used.
flatMap allows to create multiple lines from one line.
Map processes one line. From one word, two fields are created: the original word and a field with the length of a word.
filter allows to filter the lines.
groupByKey aggregates the lines by the first field that acts as a key.
map then translates the aggregate into something that is human readable.
collect displays the results.