Below, I provide some Python code to write an AVRO file. An AVRO file consists of a scheme and a set of records. The records are written in binary format. The scheme is as follows:
{"type": "record", "name": "StringPair", "doc": "A pair of strings.", "fields": [ {"name": "left", "type": "string"}, {"name": "right", "type": "string"}]}
The code to write such file is as follows:
import sys from avro import schema from avro import io from avro import datafile if __name__ == '__main__': if len(sys.argv) != 2: sys.exit('Usage: %s' % sys.argv[0]) avro_file = sys.argv[1] writer = open(avro_file, 'wb') datum_writer = io.DatumWriter() schema_object = schema.Parse(open(b'C:\\Users\\tmaanen\\.spyder-py3\\tom.avsc', "r").read()) dfw = datafile.DataFileWriter(writer, datum_writer, schema_object) for line in sys.stdin.readlines(): (left, right) = line.split(',') dfw.append({'left':left, 'right':right}); dfw.close()
The script can be run on the command line as C:\ProgramData\Anaconda3\python.exe C:\Users\tmaanen\.spyder-py3\TomHdfs.py C:\Users\tmaanen\.spyder-py3\a.avro