23. Adding to the database by writing a script
Using Django Admin (part 22) is good if we want to survey the state of the database, check out what our users are doing, correct small errors etc.
It’s often not the best way to add data en masse to our database. Records have to be created individually page by page, as you’ve seen. And if we needed to add a very large data set to the database for users interact with, we’d be there for a long time.
So it often pays to be able to write scripts. And in this case, ones that will use the Django classes and our model class to interact with and build up our database. Or we could just write a script that directly inserts database data into our Postgres system.
We’re going to populate the database with the data that we’ve already seen from the normalized database lessons in the previous examples.
We’ve got a sample csv file with data to add in, called exampledatatoload.csv. This is in our bioweb directory.
gene_id,entity,start,stop,sequencing_factory,factory_locaton,sense,start_codon,enzyme_class,gene_type,gene_product,attributes Gene1,Chromosome,190,255,Sanger,UK,+,M,oxidoreductase,gene:mrna;cds:protein,ID:b001;Name:thrL;biotype:protein_coding;description:Fructokinase Gene2,Chromosome,375,566,Sanger,UK,-,M,transferase,gene:mrna;cds:protein,ID:b002;Name:fasT;biotype:protein_coding Gene3,Chromosome,780,980,Sanger,UK,-,M,oxidoreductase,gene:mrna;cds:protein,ID:b003;Name:speK;biotype:protein_coding Gene4,Plasmid,558,696,Sanger,UK,U,M,transferase,gene:mrna;cds:protein,biotype:protein_coding Gene5,Plasmid,786,888,Sanger,UK,U,M,transferase,gene:mrna;cds:protein,biotype:protein_coding
The data is separated by ; and some of the individual data items are pairs separated by : so we will need to incorporate this into our code.
In bioweb/bioweb we can create a new folder called scripts and then create a new file called populate_bioweb.py. This creates a blank .py file that we need to add to.
The first thing we need to do is import various Python packages that we’re going to need to let the Django system initialize, parse the file and insert the data.
import os import sys import django import csv from collections import defaultdict
And then we need to initialise the Django system.
sys.path.append('/Users/paulg.DESKTOP-AU8JD0H/bioweb/bioweb') os.environ.setdefault('DJANGO_SETTINGS_MODULE','bioweb.settings') django.setup()
(don’t use C:\ etc, it won’t work) Once we’ve saved this we need to check it’s all working, so go to CMD and navigate to that directory
(env) $ python scripts/populate_bioweb.py
If it works, no error will be sent to the screen.
Now back in our .py file we need to import the various models that we wrote that we want to insert data into.
from genedata.models import *
And create a string which points at our data file:
data_file = '/Users/paulg.DESKTOP-AU8JD0H/bioweb/exampledatatoload.csv';
and then set up various empty data structures:
genes = defaultdict(list) sequencing = set() ec = set() products = defaultdict(dict) attributes = defaultdict(dict)
We need to open the file and pass it to the standard CSV reader:
with open(data_file) as csv_file: csv_reader = csv.reader(csv_file, delimiter=',')
The first line in the CSV was just the headings and we don’t want that so we can skip that line like this:
header = csv_reader.__next__()
We can then iterate over all the remaining rows:
for row in csv_reader: product_pairs = row[9].split(';') attribute_pairs = row[10].split(';') for pair in product_pairs: tupple = pair.split(":") products[row[0]][tupple[0]] = tupple[1] for pair in attribute_pairs: tupple = pair.split(":") attributes[row[0]][tupple[0]] = tupple[1] ec.add(row[8]) sequencing.add((row[4], row[5])) genes[row[0]] = row[1:4]+row[6:9]
I should point out this does seem like a complicated example which is splitting the data line into the info we need. Parts like ec.add(row[8]) are a simpler demonstration of what’s going on here.
For safety’s sake we can clear out any data already existing in the database just in case there was some left. For each table in turn we can delete all the objects in it:
GeneAttributeLink.objects.all().delete() Gene.objects.all().delete() EC.objects.all().delete() Sequencing.objects.all().delete() Attribute.objects.all().delete() Product.objects.all().delete()
Some tables have foreign key relationships, so we need to keep tracks of which tables have which foreign keys. We do that like this.
ec_rows = {} sequencing_rows = {} gene_rows = {}
Now we can loop over the entries in the ec table and the entries in the sequencing table and add them to the database.
for entry in ec: row = EC.objects.create(ec_name=entry) row.save() ec_rows[entry] = row
This creates a new row and then saves it to the database.
We can do the same thing for the sequencing centres:
for seq_centre in sequencing: row = Sequencing.objects.create(sequencing_factory=seq_centre[0],factory_location=seq_centre[1]) row.save() sequencing_rows[seq_centre[0]] = row
This part evaluates a foreign key relationship:
sequencing = sequencing_rows['Sanger'], ec=ec_rows[data[5]])
Here is the whole script:
import os import sys import django import csv from collections import defaultdict sys.path.append('/Users/paulg.DESKTOP-AU8JD0H/bioweb/bioweb') os.environ.setdefault('DJANGO_SETTINGS_MODULE','bioweb.settings') django.setup() from genedata.models import * data_file = '/Users/paulg.DESKTOP-AU8JD0H/bioweb/exampledatatoload.csv'; genes = defaultdict(list) sequencing = set() ec = set() products = defaultdict(dict) attributes = defaultdict(dict) with open(data_file) as csv_file: csv_reader = csv.reader(csv_file, delimiter=',') header = csv_reader.__next__() for row in csv_reader: product_pairs = row[9].split(';') attribute_pairs = row[10].split(';') for pair in product_pairs: tupple = pair.split(":") products[row[0]][tupple[0]] = tupple[1] for pair in attribute_pairs: tupple = pair.split(":") attributes[row[0]][tupple[0]] = tupple[1] ec.add(row[8]) sequencing.add((row[4], row[5])) genes[row[0]] = row[1:4]+row[6:9] # only do this to clear out any data that is in the tables that you don't already want! GeneAttributeLink.objects.all().delete() Gene.objects.all().delete() EC.objects.all().delete() Sequencing.objects.all().delete() Attribute.objects.all().delete() Product.objects.all().delete() ec_rows = {} sequencing_rows = {} gene_rows = {} for entry in ec: row = EC.objects.create(ec_name=entry) row.save() ec_rows[entry] = row for seq_centre in sequencing: row = Sequencing.objects.create(sequencing_factory=seq_centre[0],factory_location=seq_centre[1]) row.save() sequencing_rows[seq_centre[0]] = row for gene_id, data in genes.items(): row = Gene.objects.create(gene_id=gene_id, entity = data[0], start=data[1], stop=data[2], sense=data[3], start_codon=data[4], sequencing = sequencing_rows['Sanger'], ec=ec_rows[data[5]]) row.save() gene_rows[gene_id] = row for gene_id, data_dict in products.items(): for key in data_dict.keys(): row = Product.objects.create(type=key, product=data_dict[key], gene=gene_rows[gene_id]) row.save() for gene_id, data_dict in attributes.items(): for key in data_dict.keys(): row = Attribute.objects.create(key=key, value=data_dict[key]) row.gene.add(gene_rows[gene_id]) row.save()
Wednesday 3 November 2021, 552 views
Next post: 24. Using the ORM in views.py Previous post: 22. Adding to the database with Django Admin
Advanced Web Development index
- 38. Writing API tests
- 37. Testing in Django
- 36. Class-based views in the Django REST framework
- 35. Building a RESTful web service in Django
- 34. Introduction to CRUD, REST and APIs
- 33. Refactoring with generic views in Django
- 32. Django validators
- 31. Django forms (2) – using the ModelForm class
- 30. Django forms (1)
- 29. JavaScript basics
- 28. Adding CSS to the template
- 27. Django templating
- 26. Deleting and updating records
- 25. Joins, filters and chaining commands
- 24. Using the ORM in views.py
- 23. Adding to the database by writing a script
- 22. Adding to the database with Django Admin
- 21. Migrations
- 20. ORM – work through example
- 19. An introduction to the Object-Relational Mapper
- 18. Altering the database
- 17. SQL functions and summaries
- 16. SQL Query performance
- 15. Queries and table joins in SQL
- 14. Inserts and queries in SQL
- 13. Good practice in relational database design
- 12. Limitations to database modelling
- 11. Building a database using SQL
- 10. Introduction to PostgreSQL
- 9. How to start writing a new application in Django
- 8. Building a lightweight project
- 7. Django URLs
- 6. Django templates
- 5. Django models
- 4. Django views
- 3. Creating a new hello app
- 2. Creating a new virtual environment
- 1. Setting up Django
Leave a Reply