23. Adding to the database by writing a script

Using Django Admin (part 22) is good if we want to survey the state of the database, check out what our users are doing, correct small errors etc.

It’s often not the best way to add data en masse to our database. Records have to be created individually page by page, as you’ve seen. And if we needed to add a very large data set to the database for users interact with, we’d be there for a long time.

So it often pays to be able to write scripts. And in this case, ones that will use the Django classes and our model class to interact with and build up our database. Or we could just write a script that directly inserts database data into our Postgres system.

We’re going to populate the database with the data that we’ve already seen from the normalized database lessons in the previous examples.

We’ve got a sample csv file with data to add in, called exampledatatoload.csv. This is in our bioweb directory.

gene_id,entity,start,stop,sequencing_factory,factory_locaton,sense,start_codon,enzyme_class,gene_type,gene_product,attributes
Gene1,Chromosome,190,255,Sanger,UK,+,M,oxidoreductase,gene:mrna;cds:protein,ID:b001;Name:thrL;biotype:protein_coding;description:Fructokinase
Gene2,Chromosome,375,566,Sanger,UK,-,M,transferase,gene:mrna;cds:protein,ID:b002;Name:fasT;biotype:protein_coding
Gene3,Chromosome,780,980,Sanger,UK,-,M,oxidoreductase,gene:mrna;cds:protein,ID:b003;Name:speK;biotype:protein_coding
Gene4,Plasmid,558,696,Sanger,UK,U,M,transferase,gene:mrna;cds:protein,biotype:protein_coding
Gene5,Plasmid,786,888,Sanger,UK,U,M,transferase,gene:mrna;cds:protein,biotype:protein_coding

The data is separated by ; and some of the individual data items are pairs separated by : so we will need to incorporate this into our code.

In bioweb/bioweb we can create a new folder called scripts and then create a new file called populate_bioweb.py. This creates a blank .py file that we need to add to.

The first thing we need to do is import various Python packages that we’re going to need to let the Django system initialize, parse the file and insert the data.

import os
import sys
import django
import csv
from collections import defaultdict

And then we need to initialise the Django system.

sys.path.append('/Users/paulg.DESKTOP-AU8JD0H/bioweb/bioweb')
os.environ.setdefault('DJANGO_SETTINGS_MODULE','bioweb.settings')
django.setup()

(don’t use C:\ etc, it won’t work) Once we’ve saved this we need to check it’s all working, so go to CMD and navigate to that directory

(env) $ python scripts/populate_bioweb.py

If it works, no error will be sent to the screen.

Now back in our .py file we need to import the various models that we wrote that we want to insert data into.

from genedata.models import *

And create a string which points at our data file:

data_file = '/Users/paulg.DESKTOP-AU8JD0H/bioweb/exampledatatoload.csv';

and then set up various empty data structures:

genes = defaultdict(list)
sequencing = set()
ec = set()
products = defaultdict(dict)
attributes = defaultdict(dict)

We need to open the file and pass it to the standard CSV reader:

with open(data_file) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')

The first line in the CSV was just the headings and we don’t want that so we can skip that line like this:

    header = csv_reader.__next__()

We can then iterate over all the remaining rows:

    for row in csv_reader:
        product_pairs = row[9].split(';')
        attribute_pairs = row[10].split(';')
        for pair in product_pairs:
            tupple = pair.split(":")
            products[row[0]][tupple[0]] = tupple[1]
        for pair in attribute_pairs:
            tupple = pair.split(":")
            attributes[row[0]][tupple[0]] = tupple[1]
        ec.add(row[8])
        sequencing.add((row[4], row[5]))
        genes[row[0]] = row[1:4]+row[6:9]

I should point out this does seem like a complicated example which is splitting the data line into the info we need. Parts like ec.add(row[8]) are a simpler demonstration of what’s going on here.

For safety’s sake we can clear out any data already existing in the database just in case there was some left. For each table in turn we can delete all the objects in it:

GeneAttributeLink.objects.all().delete()
Gene.objects.all().delete()
EC.objects.all().delete()
Sequencing.objects.all().delete()
Attribute.objects.all().delete()
Product.objects.all().delete()

Some tables have foreign key relationships, so we need to keep tracks of which tables have which foreign keys. We do that like this.

ec_rows = {}
sequencing_rows = {}
gene_rows = {}

Now we can loop over the entries in the ec table and the entries in the sequencing table and add them to the database.

for entry in ec:
    row = EC.objects.create(ec_name=entry)
    row.save()
    ec_rows[entry] = row

This creates a new row and then saves it to the database.

We can do the same thing for the sequencing centres:

for seq_centre in sequencing:
    row = Sequencing.objects.create(sequencing_factory=seq_centre[0],factory_location=seq_centre[1])
    row.save()
    sequencing_rows[seq_centre[0]] = row

This part evaluates a foreign key relationship:

        sequencing = sequencing_rows['Sanger'],
        ec=ec_rows[data[5]])

Here is the whole script:

import os
import sys
import django
import csv
from collections import defaultdict

sys.path.append('/Users/paulg.DESKTOP-AU8JD0H/bioweb/bioweb')
os.environ.setdefault('DJANGO_SETTINGS_MODULE','bioweb.settings')
django.setup()

from genedata.models import *

data_file = '/Users/paulg.DESKTOP-AU8JD0H/bioweb/exampledatatoload.csv';

genes = defaultdict(list)
sequencing = set()
ec = set()
products = defaultdict(dict)
attributes = defaultdict(dict)

with open(data_file) as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    header = csv_reader.__next__()
    for row in csv_reader:
        product_pairs = row[9].split(';')
        attribute_pairs = row[10].split(';')
        for pair in product_pairs:
            tupple = pair.split(":")
            products[row[0]][tupple[0]] = tupple[1]
        for pair in attribute_pairs:
            tupple = pair.split(":")
            attributes[row[0]][tupple[0]] = tupple[1]
        ec.add(row[8])
        sequencing.add((row[4], row[5]))
        genes[row[0]] = row[1:4]+row[6:9]

# only do this to clear out any data that is in the tables that you don't already want!
GeneAttributeLink.objects.all().delete()
Gene.objects.all().delete()
EC.objects.all().delete()
Sequencing.objects.all().delete()
Attribute.objects.all().delete()
Product.objects.all().delete()

ec_rows = {}
sequencing_rows = {}
gene_rows = {}

for entry in ec:
    row = EC.objects.create(ec_name=entry)
    row.save()
    ec_rows[entry] = row

for seq_centre in sequencing:
    row = Sequencing.objects.create(sequencing_factory=seq_centre[0],factory_location=seq_centre[1])
    row.save()
    sequencing_rows[seq_centre[0]] = row

for gene_id, data in genes.items():
    row = Gene.objects.create(gene_id=gene_id, entity = data[0],
        start=data[1], stop=data[2], sense=data[3], start_codon=data[4],
        sequencing = sequencing_rows['Sanger'],
        ec=ec_rows[data[5]])
    row.save()
    gene_rows[gene_id] = row

for gene_id, data_dict in products.items():
    for key in data_dict.keys():
        row = Product.objects.create(type=key, product=data_dict[key],
            gene=gene_rows[gene_id])
        row.save()

for gene_id, data_dict in attributes.items():
    for key in data_dict.keys():
        row = Attribute.objects.create(key=key, value=data_dict[key])
        row.gene.add(gene_rows[gene_id])
        row.save()
Wednesday 3 November 2021, 19 views


Leave a Reply

Your email address will not be published. Required fields are marked *