the red penguin
HOME ABOUT SITEMAP BLOG LOGIN

16. SQL Query performance

We’ve now looked at how to design a normalised database, some simple queries and some types of table joining operation that you can perform.

What’s probably not obvious from our example database and its small amount of data is that querying a database can be a slow operation.

You can conceptually imagine that the database system builds the Cartesian product of any tables it’s joining, and then filters them to produce only the rows we’re requesting. If the database grows to the point of having multiple tables with many millions of rows of data, then building the product to two tables and then filtering it can require large amounts of memory or CPU time.

If we were to add third or more adjoint tables to our queries, we quickly get into exponential increases in size for our Cartesian products. Very large table joins are slow operations and this can be an issue. Thankfully, though database systems have many optimisations to ameliorate this.

Consider selecting everything from the genes table:

example=# SELECT * FROM genes;

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk
----+---------+------------+--------+-------+------+---------------+-------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2
(5 rows)

Imagine we are filtering for the entity data, like this:

example=# SELECT * FROM genes where entity='Chromosome';

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk
----+---------+------------+--------+-------+------+---------------+-------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1
(3 rows)

If we’re doing this often we can consider adding a table index. The simplest optimisation is adding column indices. So this is how we would create an index on the entity field in the genes table:

CREATE INDEX entity_index ON genes (entity);

We can consider optimising simple joins too. So we know if we join two tables we get the cartesian product of those tables, like this:

example=# SELECT * FROM genes, ec;
 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  1 | oxidioreductase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  1 | oxidioreductase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  1 | oxidioreductase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  1 | oxidioreductase
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
(10 rows)

If we want to find out the rows where the primary keys match, as before:

example=# SELECT * FROM genes, ec WHERE genes.ec_pk=ec.pk;
 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  1 | oxidioreductase
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
(5 rows)

We can re-run the query prefixing with EXPLAIN to get more information:

example=# EXPLAIN SELECT * FROM genes, ec WHERE genes.ec_pk=ec.pk;

                             QUERY PLAN
--------------------------------------------------------------------
 Hash Join  (cost=1.11..13.09 rows=5 width=2088)
   Hash Cond: (ec.pk = genes.ec_pk)
   ->  Seq Scan on ec  (cost=0.00..11.40 rows=140 width=520)
   ->  Hash  (cost=1.05..1.05 rows=5 width=1568)
         ->  Seq Scan on genes  (cost=0.00..1.05 rows=5 width=1568)
(5 rows)

We can also get a time esimate with EXPLAIN ANALYZE:

example=# EXPLAIN ANALYZE SELECT * FROM genes, ec WHERE genes.ec_pk=ec.pk;

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1.11..13.09 rows=5 width=2088) (actual time=0.022..0.024 rows=5 loops=1)
   Hash Cond: (ec.pk = genes.ec_pk)
   ->  Seq Scan on ec  (cost=0.00..11.40 rows=140 width=520) (actual time=0.007..0.008 rows=2 loops=1)
   ->  Hash  (cost=1.05..1.05 rows=5 width=1568) (actual time=0.006..0.006 rows=5 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on genes  (cost=0.00..1.05 rows=5 width=1568) (actual time=0.003..0.004 rows=5 loops=1)
 Planning Time: 0.088 ms
 Execution Time: 0.038 ms
(8 rows)<

This is a useful way to find out if your joins and queries are "expensive" or not.

Let's go back to SELECT * FROM genes WHERE entity='Chromosome'. We know this returned three rows instead of the five rows originally on the table.

This in essence is returning a smaller table. We can instead nest this into our query. Instead of joining the whole of the genes table to the ec table, we can join this filtered bit of the genes table to the ec table.

What we can do is have a nested sub-query. Consider, you'll say select everything from the sub-table, and select everything from the ec table. Now usually we would specify tables, but here we can specify the results of the query. We can add the query that we know returns only three rows.

SELECT sub.*, ec.* FROM (SELECT * FROM genes WHERE entity='Chromosome') sub JOIN ec ON sub.ec_pk=ec.pk;

We have to give it a name, so we'll call it sub here. Then we're going to join our sub-query results to the ec table. Then we've got to match up the relevant foreign and primary keys, just like normal.

When we run this, we get the same result with the relevant ec_name column too.

example=# SELECT sub.*, ec.* FROM (SELECT * FROM genes WHERE entity='Chromosome') sub JOIN ec ON sub.ec_pk=ec.pk;

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  1 | oxidioreductase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  1 | oxidioreductase
(3 rows)

What this query does is run the original SELECT * FROM genes WHERE entity='Chromosome' query first, puts the result in a new temporary table called sub, and then declaring an inner join with ec where the primary keys match, i.e. sub.ec_pk=ec.pk.

In other words we're not joining two large tables, we've filtered the first large table with the results we want, making it much smaller, and then joined this to the second table. It's a specific kind of nested query.

This will make it quicker - although our tables are very small here so we wouldn't notice much difference.

Performing the same table joins over and over may introduce substantial overhead and query resources and the time it takes to complete those queries. In a highly concurrent application with few CPU resources, some queries may end up having to wait for others to complete; so different users are waiting for other users.

If our genes and ec table had many millions of rows and users were frequently having to join them together to get results. We could address this by denormalising our database a little. There are two possible approaches here. The first is to actually alter our database design and we could consider adding a joint table.

We could create a new table, genes with ec:

CREATE TABLE genes_with_ec (
pk SERIAL PRIMARY KEY,
gene_id VARCHAR(256) NOT NULL, 
entity VARCHAR(256), 
source VARCHAR(256), 
start INT, 
stop INT,
sequencing_pk INT, 
ec_name VARCHAR(256), 
FOREIGN KEY (sequencing_pk) REFERENCES sequencing(pk)
);

An alternative approach is to create what's called the materialized view. This is a form of adjuncts table that contains the contents of a database query. Let's have a look at what that looks like.

example=# CREATE MATERIALIZED VIEW gene_ec_view AS SELECT gene_id, entity, source, start, stop, sequencing_pk, ec_name FROM genes, ec WHERE genes.ec_pk=ec.pk;

SELECT 5

Then we can use the materialized view table as a normal table, so select from it as normal:

example=# SELECT * FROM gene_ec_view;

 gene_id |   entity   | source | start | stop | sequencing_pk |     ec_name
---------+------------+--------+-------+------+---------------+-----------------
 Gene3   | Chromosome | ena    |   780 |  980 |             1 | oxidioreductase
 Gene1   | Chromosome | ena    |   190 |  255 |             1 | oxidioreductase
 Gene5   | Plasmid    | ena    |   786 |  888 |             1 | transferase
 Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 | transferase
 Gene2   | Chromosome | ena    |   375 |  566 |             1 | transferase
(5 rows)

It's worth noting in postgres that materialized views do not refresh when the underlying data in the genes or EC table is updated. When you change the data in a main data table, you also need to refresh the materialized view, and there's a command for that which is REFRESH MATERIALIZED VIEW. Then you name the table that you wish to refresh and it will update the results if you change the base data.

 REFRESH MATERIALIZED VIEW gene_ec_view;

Obviously, nothing will change in this instance.

Quiz:

You have the following database tables

What do the following SQL statements do?

1. SELECT * FROM car_models;

(a) Returns all the data in all tables of the database
(b) Returns all the data in the first row of the car_models table
(c) Returns all the data, in rows, from the car_models table

2. SELECT manufacturer, seats FROM car_models;

(a) All data from both the car_models and manufacturer tables will be returned
(b) This query will not run

3. SELECT model_name, countryOfOrigin FROM car_models, manufacturer WHERE car_models.manufacturer_pk=manufacturer.pk;

(a) This will join the car_models and manufacturer tables then return a list of model_names and the linked countryOfOrigin
(b) This query will not run

Answers:

1 (c)
2 (b) - manufacturer is the name of another table, not a field in car_models
3 (a)

Wednesday 27 October 2021, 449 views


Leave a Reply

Your email address will not be published. Required fields are marked *