Google Refine

I am subscribed to several blogs. One such blog is a blog from Dylan Jones that is maintained from the “Data Quality Pro community”. From this blog I got the suggestion to investigate Google Refine. And it was surely a worthwhile investment.

Google Refine is a tool that is acquired by Google to improve data quality. When used for this purpose, it is immensely powerful.

Google Refine can be downloaded here . It is a download of approximately 30 MB. It can be installed on a Windows, Linux and an Apple platform. Installation is straightforward, allthough a Java Runtime Engine is required. Also: installation is free, no costs involved nor a registration. After launching, the application uses the web browser as interface.

The application is able to read/ write text formatted files. If the data are stored in an Oracle database, one should extract such tables to external text files. After extraction one could load such data in Google Refine. Loading data is easily done: in the interface we may notice the button “open” that reads a file. Subsequently “export” allows to write the data to an external file. I really appreciate such a well thought interface: trivial functions such as reading and writing should be made easily accessible.

The data can be analysed when the mouse is positioned at the header of a column. A small drop down list displays itself with “facet”, subsequently “text facet”. Selecting this creates a frequency overview on this column. This allows a quick overview on the domain of the column values. This is functionality that is also known as creating a pivot table. From this moment on, we get access to functionality that is really awesome. This frequency overview can be analysed with a cluster analysis that clearly displays values that closely resembles each other. We know that in many tables different spellings can be used to express the same thing: “Transport Accident” can also be written as “Transport accident”. However a frequency diagram then contains two lines: one for “Transport Accident” and another for “Transport accident”. Google Refine allows to detect such small differences in the cluster analysis automatically. The deviations can also be synchronised automatically. We can then avoid the tedious manual adjustment of values.