Below, I provide some Python code to write an AVRO file. An AVRO file consists of a scheme and a set of records. The records are written in binary format. The scheme is as follows:
{"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}]}
The code to write such file is as follows:
import sys
from avro import schema
from avro import io
from avro import datafile
if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit('Usage: %s ' % sys.argv[0])
avro_file = sys.argv[1]
writer = open(avro_file, 'wb')
datum_writer = io.DatumWriter()
schema_object = schema.Parse(open(b'C:\\Users\\tmaanen\\.spyder-py3\\tom.avsc', "r").read())
dfw = datafile.DataFileWriter(writer, datum_writer, schema_object)
for line in sys.stdin.readlines():
(left, right) = line.split(',')
dfw.append({'left':left, 'right':right});
dfw.close()
The script can be run on the command line as C:\ProgramData\Anaconda3\python.exe C:\Users\tmaanen\.spyder-py3\TomHdfs.py C:\Users\tmaanen\.spyder-py3\a.avro