When building databases we should be careful to reduce the amount of redundant data – in an ideal world, the tables of our database should not contain any repeated data, ensuring that data points are not repeated in our database.
This is called data normalisation. It is the process of correctly disaggregating data sets, such that each data point is atomic – represented only a single time and efficiently represented by the relational model.
The principal benefits of this update your database will be logically arranged such that the data has no repetition or redundancy. This in turn, allows it to be efficiently stored and queried in most modern database systems.
The objectives of normalisation beyond 1NF (first normal form) were stated as follows by Codd:
1NF (first normal form) – this requires that our data is indexed with a unique primary key, that there are no repeating unit groups, and that our data is atomic.
Consider this table:
It would make sense for the attributes to be taken out and given their own table. Also, the factory could come up again and again, so it would be better to create a factory table and link it back with a foreign key.
Also – each gene could have some or all of the attributes, so to prevent further redundancy we can create a new table linking the genes and the attributes:
This now brings us into 1NF – data is indexed with a unique primary key, there are no repeating unit groups, and that our data is atomic.
2NF (second normal form) requires that we remove any columns which may predict each other. Consider another problem, with this new table:
You can see that two columns – EC Number and EC Name – seem to depend on each other. We can predict the content of EC Name from EC Number, and vice versa.
We should be able to extract these columns into a new table and use a foreign key to link them, as follows:
This ensures our dataset is now in 3NF (third normal form).
We end up with a series of tables that remove any data redundancy:
Here is how we can set these tables up referencing each other as explained above:
CREATE DATABASE gene_test; CREATE TABLE ec(pk SERIAL PRIMARY KEY, EC_name VARCHAR(256)); CREATE TABLE sequencing(pk SERIAL PRIMARY KEY, sequencing_factory VARCHAR(256), factory_location VARCHAR(256)); CREATE TABLE genes (pk SERIAL PRIMARY KEY, gene_id VARCHAR(256) NOT NULL, entity VARCHAR(256), source VARCHAR(256), start INT, stop INT, sequencing_pk INT, ec_pk INT, FOREIGN KEY (sequencing_pk) REFERENCES sequencing(pk), FOREIGN KEY (ec_pk) REFERENCES ec(pk)); CREATE TABLE products (genes_pk INT, type VARCHAR(256), product VARCHAR(256), FOREIGN KEY (genes_pk) REFERENCES genes(pk)); CREATE TABLE attributes (pk SERIAL PRIMARY KEY, key VARCHAR(256), value VARCHAR(256)); CREATE TABLE gene_attribute_link(genes_pk INT, attributes_pk INT, FOREIGN KEY (genes_pk) REFERENCES genes(pk), FOREIGN KEY (attributes_pk) REFERENCES attributes(pk));