Avro files are binary files that contain data and the description of the files. Thereby it is a very interesting file format. One may send this file to any application that is able to read Avro files. Just as an example: one may write the file is (say) PHP and send it to (say) Java. In previous posts I showed how such file could be written and read by PHP. See a post here.
In this note I show one may use a jar file to create and to read an avro file. The jar file is avro-tools-1.8.1.jar. This jar file enables us to create an avro file from a schema definition and a json file. The schema file looks like:
{ "type" : "record", "name" : "twitter_schema", "namespace" : "com.miguno.avro", "fields" : [ { "name" : "username", "type" : "string", "doc" : "Name of the user account on Twitter.com" }, { "name" : "tweet", "type" : "string", "doc" : "The content of the user's Twitter message" }, { "name" : "timestamp", "type" : "long", "doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }
wheras the JSON data file looks like:
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp":1366150681} {"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp":1366154481}
This can then be combined in an avro files with:
java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" fromjson --schema-file D:\Users\tmaanen\CloudStation\java\avro2\user.avsc D:\Users\tmaanen\CloudStation\java\avro2\user.json > D:\Users\tmaanen\CloudStation\java\avro2\user.avro
We now have an avro file. This is a binary file. This file can translated to a json file with:
java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" tojson D:\Users\tmaanen\CloudStation\java\avro2\user.avro > D:\Users\tmaanen\CloudStation\java\avro2\user2.json
Likewise the scheme can be derived with:
java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" getschema D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.avro > D:\Users\tmaanen\CloudStation\java\avro2\user2.avsc
For me, this utility is very handy to investigate the result from a sqoop command. Roughly stated, such sqoop command may import the contents of a database table to an HDFS platform. Such command may look like:
sqoop import \ --connect "jdbc:oracle:thin:@(description=(address=(protocol=tcp)(host=192.168.2.2)(port=1521))(connect_data=(service_name=orcl)))" \ --username scott --password binvegni \ --table fam \ --columns "NUMMER, NAAM" \ --m 1 \ --target-dir /loudacre/fam_avro \ --null-non-string '\\N' \ --as-avrodatafile
The output from such command might be an avro file that might be called part-m-00000.avro. The question is: how do I know that this file contains the correct data? I could then import the avro file to Windows and translate it with:
java -jar "C:/Program Files/Java/avro-tools-1.8.1.jar" tojson D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.avro > D:\Users\tmaanen\CloudStation\java\avro2\part-m-00000.json
This provides me the confirmation that the avro file is correct.