15. Queries and table joins in SQL

We previously saw that we can use simple queries to obtain data from the tables and database. In those instances, we were looking up relevant primary keys, independent tables such that we could add new rows to our genes tables. The power of a relational database is in building relational queries.

Using our previous example:

We can query individual tables like this:

example=#  SELECT * FROM genes;

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk
----+---------+------------+--------+-------+------+---------------+-------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2
(5 rows)


example=#  SELECT * FROM ec;

 pk |     ec_name
----+-----------------
  1 | oxidioreductase
  2 | transferase
(2 rows)
If the tables are linked, we can also query two tables together:

example=#  SELECT * FROM genes, ec;

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  1 | oxidioreductase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  1 | oxidioreductase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  1 | oxidioreductase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  1 | oxidioreductase
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
(10 rows)

Here every row of data in the genes table, rows 1-5, has been paired with every row of data in the ec table, rows 1-2.

This isn’t particularly informative but just an illustration of how it all works.

We can restrict our queries so that it only allows rows where the primary and foreign keys are correctly matched. So let’s do this:

example=# SELECT * FROM genes, ec WHERE genes.ec_pk=ec.pk;

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |  1 | oxidioreductase
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
(5 rows)

This is returning all the rows in the 10 row table where ec_pk and pk match. We’re talking about the second pk, as that’s the one that belongs to ec (ec.pk).

We can refine the search further omitting the various primary key and foreign key values, as that’s the data we probably really want.

example=# SELECT gene_id, entity, source, start, stop, ec_name FROM genes, ec WHERE genes.ec_pk=ec.pk;

 gene_id |   entity   | source | start | stop |     ec_name
---------+------------+--------+-------+------+-----------------
 Gene3   | Chromosome | ena    |   780 |  980 | oxidioreductase
 Gene1   | Chromosome | ena    |   190 |  255 | oxidioreductase
 Gene5   | Plasmid    | ena    |   786 |  888 | transferase
 Gene4   | Plasmid    | ena    |  1001 | 1111 | transferase
 Gene2   | Chromosome | ena    |   375 |  566 | transferase
(5 rows)

We can make a more complex query by searching from three tables:

example=#  SELECT gene_id, entity, source, start, stop, ec_name, sequencing_factory,
example-# factory_location FROM genes, ec, sequencing WHERE genes.ec_pk=ec.pk AND
example-# genes.sequencing_pk=sequencing.pk;

 gene_id |   entity   | source | start | stop |     ec_name     | sequencing_factory | factory_location
---------+------------+--------+-------+------+-----------------+--------------------+------------------
 Gene3   | Chromosome | ena    |   780 |  980 | oxidioreductase | Sanger             | UK
 Gene1   | Chromosome | ena    |   190 |  255 | oxidioreductase | Sanger             | UK
 Gene5   | Plasmid    | ena    |   786 |  888 | transferase     | Sanger             | UK
 Gene4   | Plasmid    | ena    |  1001 | 1111 | transferase     | Sanger             | UK
 Gene2   | Chromosome | ena    |   375 |  566 | transferase     | Sanger             | UK
(5 rows)

You’ll note that this linking together tables is essentially the reverse operation to the normalizing we previously thought through in an earlier video. Our attributes were linked in many ternary relationships through a link table. This requires slightly more careful handling. Let’s look at how are we going to do that.

example=#  SELECT gene_id, key, value FROM genes, attributes, gene_attribute_link WHERE
example-# genes.pk=gene_attribute_link.genes_pk AND
example-# attributes.pk=gene_attribute_link.attributes_pk;

 gene_id |     key     |           value
---------+-------------+---------------------------
 Gene1   | ID          | gene:b001
 Gene1   | Name        | thrL
 Gene1   | biotype     | protein_coding
 Gene1   | description | thr operon leader peptide
 Gene2   | biotype     | protein_coding
 Gene2   | Name        | fucA
 Gene2   | description | Fructokinase A
 Gene3   | biotype     | protein_coding
 Gene3   | Name        | timB
 Gene3   | description | Triosphosphare Isomerase
 Gene4   | biotype     | protein_coding
 Gene4   | Name        | gluA
 Gene4   | description | Glucose isomerase
(13 rows)

In these examples we are doing table joins – which is implicitly implied. We can explicitly mention which table joins we want to use. These are equivalent:

SELECT * FROM genes, ec WHERE genes.ec_pk=ec.pk;
SELECT * FROM genes INNER JOIN ec ON genes.ec_pk=ec.pk;

We can think of joins as set operations:

We can use a LEFT JOIN which requests that we return every row of data from the first or left-hand table in the select statement, plus only those rows from the right hand table that satisfy our joining requirement.

example=# SELECT * FROM genes LEFT JOIN ec ON genes.ec_pk=ec.pk AND
example-# ec_name='transferase';

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |   ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-------------
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |    |
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |    |
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
(5 rows)

Similarly:

example=#  SELECT * FROM genes RIGHT JOIN ec ON genes.ec_pk=ec.pk AND
example-# ec_name='transferase';

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
    |         |            |        |       |      |               |       |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
(4 rows)

And a full join:

example=# SELECT * FROM genes FULL JOIN ec ON genes.ec_pk=ec.pk AND
example-# ec_name='transferase';

 pk | gene_id |   entity   | source | start | stop | sequencing_pk | ec_pk | pk |     ec_name
----+---------+------------+--------+-------+------+---------------+-------+----+-----------------
    |         |            |        |       |      |               |       |  1 | oxidioreductase
  5 | Gene5   | Plasmid    | ena    |   786 |  888 |             1 |     2 |  2 | transferase
  4 | Gene4   | Plasmid    | ena    |  1001 | 1111 |             1 |     2 |  2 | transferase
  2 | Gene2   | Chromosome | ena    |   375 |  566 |             1 |     2 |  2 | transferase
  3 | Gene3   | Chromosome | ena    |   780 |  980 |             1 |     1 |    |
  1 | Gene1   | Chromosome | ena    |   190 |  255 |             1 |     1 |    |
(6 rows)
Wednesday 27 October 2021, 34 views


Leave a Reply

Your email address will not be published. Required fields are marked *