Manipulating Avro

Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java. In previous posts I showed how such file could be written and read by PHP. See a post here.
In this note I show one may use a jar file to create and to read an avro file. The jar file is avro-tools-1.8.1.jar. This jar file enables us to create an avro file from a schema definition and a json file. The schema file looks like:

{
  "type" : "record",
  "name" : "twitter_schema",
  "namespace" : "com.miguno.avro",
  "fields" : [ {
    "name" : "username",
    "type" : "string",
    "doc"  : "Name of the user account on Twitter.com"
  }, {
    "name" : "tweet",
    "type" : "string",
    "doc"  : "The content of the user's Twitter message"
  }, {
    "name" : "timestamp",
    "type" : "long",
    "doc"  : "Unix epoch time in seconds"
  } ],
  "doc:" : "A basic schema for storing Twitter messages"
}

wheras the JSON data file looks like:

 {"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp":1366150681}
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp":1366154481}

This can then be combined in an avro files with:

java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" fromjson --schema-file D:\Users\tmaanen\CloudStation\java\avro2\user.avsc D:\Users\tmaanen\CloudStation\java\avro2\user.json > D:\Users\tmaanen\CloudStation\java\avro2\user.avro

We now have an avro file. This is a binary file. This file can translated to a json file with:

java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" tojson D:\Users\tmaanen\CloudStation\java\avro2\user.avro > D:\Users\tmaanen\CloudStation\java\avro2\user2.json

Likewise the scheme can be derived with:

java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" getschema D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.avro > D:\Users\tmaanen\CloudStation\java\avro2\user2.avsc

For me, this utility is very handy to investigate the result from a sqoop command. Roughly stated, such sqoop command may import the contents of a database table to an HDFS platform. Such command may look like:

sqoop import \
--connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \
--username scott --password binvegni \
--table fam \
--columns "NUMMER, NAAM" \
--m 1 \
--target-dir /loudacre/fam_avro \
--null-non-string '\\N' \
--as-avrodatafile

The output from such command might be an avro file that might be called part-m-00000.avro. The question is: how do I know that this file contains the correct data? I could then import the avro file to Windows and translate it with:

java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" tojson D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.avro > D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.json

This provides me the confirmation that the avro file is correct.

Door tom