I just discovered the 1000th wordcount example. It is based on Pyspark. The idea is actually quite simple. One creates a script. This script can be written in any editor. The programme can then be run from the terminal by spark-submit [programme]. As an example, one may start the programme below with: spark-submit –master yarn-cluster prog1.py /user/hdfs/fam /user/hdfs/prog1A. The option –master yarn-cluster indicates how the programme is run. In this case the cluster is used. It can also be run locally; in that case –master local is used.
import sys from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) < 3: print >> sys.stderr, "Usage: WordCount <file>" exit(-1) sc = SparkContext() counts = sc.textFile(sys.argv[1]) \ .flatMap(lambda line: line.split(',')) \ .map(lambda word: (word,1)) \ .reduceByKey(lambda v1,v2: v1+v2) counts.saveAsTextFile(sys.argv[2]) sc.stop()