Datamodelling in Hadoop

Before data modelling in Hadoop can be discussed, one needs to realise that Hadoop is about files. It is not about tables and relations – it is the files that is central. This implies that we have other means at our disposal than we have in a traditional RDMBS.
As Hadoop is about files, we have to realise that data that are stored in two different tables are difficult to retrieve. One has to read two different files, combine the two files on some common denominator and subsequently deliver the end result.
Therefore, a first lesson can be drawn. The concept of full normalisation doesn’t work with Hadoop as we have no efficient means to combine data from different files. Therefore, it is good to denormalise the data. This generates a situation whereby data are stored redundantly, each time is a different context. In case of an order, the customer information is stored with the order. If the same customer opens up another order, the customer information is repeated with the new order.
A second lesson is that it is beneficial to create small files instead of large one. If one needs to join information from two different files, it is better to have small files to be read than large ones. This can be implemented by storing information in tables that are split by day. Instead of taking a large files with data from all days, one only the taken the file that pertain to a certain date for which the join must be made.
A third lesson is that directory structure must be clear. The same holds for filenames. Clear directory structure and clear filenames allow you to retrieve the data quickly.

It is good to realise that this data modelling is necessary to retrieve the data. In traditional RDBMS, we need a data model to ingest the data. With Hadoop, data can be ingested without a model; just create a set of files. It is possible. However, the price to be paid is that is difficult to retrieve the information. If one would have created to set of unrelated files, modelling will be done by the end user upon the moment he starts using the data.