the red penguin

24. Triplestores and SPARQL

24.01 Introduction

The Semantic Web is hard to search efficiently, partly because there is no registry of information.

Finding something requires a lot of indexing, which means we need crawlers.

We can build a database of graphs from a cache of triples. Triples can be indexed, and inferences built from the triples can be cached as new triples.

Triplestore is one type of graph databases which use RDF to cache a chunk of the semantic web. We can search these databases using patterns which the search engine
uses to look for complete or partial matches, returning them as the list of results.

The language used to specify such patterns is SPARQL. SPARQL stands for SPARQL Protocol And RDF Query Language.

A sample query is shown below:

PREFIX foaf: <>
PREFIX ex: <>
SELECT ?friend
    ex:Alice foaf:knows ?friend.

The query above will produce a list of URLs of people who Alice knows, and these get indexed back into the graph.

If we want to get the name of the person, rather the URL, we modify the query like so:

PREFIX foaf: <>
PREFIX ex: <>
    ex:Alice foaf:knows ?friend.
    ?friend foaf:name ?fName.

SELECT is similar to MySQL etc, but we don’t use FROM because we’re selecting from the whole graph.

Variables in SPARQL begin with question marks.

If we want to produce a list of names who are connected to Alice by any number of connections, we use:

PREFIX foaf: <>
PREFIX ex: <>
    ex:Alice foaf:knows+ ?friend.
    ?friend foaf:name ?fName.

To limit for unique names:

PREFIX foaf: <>
PREFIX ex: <>
    ex:Alice foaf:knows+ ?friend.
    ?friend foaf:name ?fName.

Question: What does the following query do?

    ?cheese ?eatenBy ?penguins
} LIMIT 10

(a) Returns a list of the names of cheeses that are eaten by penguins (there are probably fewer than 10)
(b) Return a list of 10 distinct URLs for cheeses that the Triplestore records as being eaten by penguins
(c) Returns 10 distinct URLs from the database that appear as predicates in any triple

Answer: (c) – Variable names have no effect on the semantics of the statement. It’s good practice to make them helpful, though.

Not (a) or (b) – because you might want to check the difference between variables and URLs in SPARQL (DISTINCT is used and that’s why?)

This was a bit of a trick question. It looks like we’re querying cheese eaten by penguins, but we’re just selecting cheese.

Question: Which of the following are serialisations of RDF?

These are:

  • Turtle (ttl) – My favourite RDF serialisation – succinct and easy to read.
  • n-triples (n3) – n3 is very easy to construct programmatically, and very simple (but if you’re using a library you shouldn’t care, and if you’re doing it by hand, use turtle). n-quads add an extra value to each triple that specifies a named graph, a subset of the whole knowledge graph. This allows you to indicate data that you might want to query separately.
  • RDF/XML – this is generally agreed to be horrible to look at, but machines can read it, and it was the first serialisation format to be standardised

These aren’t:

  • TEI – this is a data model for text, now serialised in XML (previously in SGML)
  • Elephant (et) – Elephant is not a format (to my knowledge)

24.02 Endpoints

A MySQL server listens on a port and communicates using a special protocol. We usually put SQL ports behind the firewall, that are usually only accessible internally – the last thing we want is for the world to be able to access our database.

The Triplestore world is a little different. A SPARQL endpoint is a server listening on an HTTP connection. The query is sent using a sets protocol based around HTTP.

The query itself is sent as a variable, in the URL, and then the results are brought back. This is usually visible to the world, or this is often visible to the world as an option.

How much you choose to lock this down depends on how much you want the world to be able to do. But a SPARQL endpoint is quite often visible to the world, allowing direct querying of a database.

A SPARQL endpoint can be used programmatically. It might be that you want to search DBpedia or Wikidata, and you can do that from Python or whatever, just directly connecting to the SPARQL endpoint that they host. You can also send your own handwritten query directly.

Quite a lot of these endpoints will put a web page of the text box that allows you to run your own queries using some simple interface for sending SPARQL queries.

There is also, however, a third-party way of doing it, pieces of it called Yasgui, which allows you to send your query to whatever SPARQL endpoint you want.

It can pull in the vocabularies that you use in your prefix section and do auto-completion. For doing rapid testing of queries, it’s actually quite a useful tool.

Typical SQPARL queries:

To look at all the data and limit the queryset returned:

    } ?subject ?predicate ?object

To look at all the types in the data:

    } ?subject a ?type

Question: What is a triplestore? (Select one answer)

(a) Three times bigger than singlestores
(b) A database for RDF triples
(c) Any web-accessible location where RDF is published

Answer: (b)

Question: Why is there no FROM clause in SPARQL? (Select one answer)

(a) There are no tables to group data in a graph database, only patterns of subject, object predicate (and classes and properties).
(b) The role of the FROM clause is taken by the PREFIX section of a query
(c) Because the triplestore is sophisticated enough to work out which tables to join.

Answer: (a). This makes querying much less constrained, but harder for the query optimiser.

Question: If I connect to DBpedia and want to explore the dataset, why would I use the LIMIT clause in the query below?

    ?subject ?predicate ?object

(a) We are only interested in the first 50 rows, since usually important data is at the beginning of databases
(b) To be fair to others, we use LIMIT 50 to use only 50% of our allocated resources
(c) Without the LIMIT clause, this would ask for a complete dump of DBpedia, which is rather large

Answer: (c)

24.03 Deferencing URIs and following your nose

SPARQL is a spectacularly useful tool for allowing us to treat the linked data world as if it were a single database. Not only can we query a triplestore using the SPARQL endpoint for whatever it is that we need, there’s also the facility for what are called federated queries where we create one query that addresses multiple SPARQL endpoints.

eg A query going to Wikidata can also have an element joined with elements from a library, SPARQL endpoint, or some other commercial organizations’ SPARQL endpoint or something local to us.

Very few SPARQL endpoints allow federated queries, due to costs involved. So we need another strategy in getting Linked Data.

This strategy is “follow your nose” – Dereferencing.

Dereferencing is not a necessary part of Linked Data, although it is a very useful one.

We know that a URI uniquely identifies a triple. When we want to find out more about that URI, we must dereference it, i.e. request the RDF document at the end of the URI.

This document, references other URIs, which we can request the RDF document for those and so on, slowly following our nose and building a local knowledge dataset.

Question. W​hy might an organisation that publishes Linked Open Data not provide a SPARQL endpoint?

(a) B​ecause they worry that running a public endpoint will mean their servers and networks will be overstretched
(b) B​ecause they don’t have a triplestore running for their data
(c) B​ecause it represents a security risk
(d) B​ecause it risks exposing private or commercially-sensitive data

Answer: (a) and (b)

(a) S​ome queries are very expensive and slow, others involve waiting for other servers to respond. Most public endpoints have timeouts built in, but it is still difficult to provision a popular endpoint on a large dataset. The situation is improving with better software and cheaper hardware.
(b) A​ triplestore takes maintenance and a server with a reasonable amount of memory. Static RDF is easy to server.
(c) This is a reasonable assumption, but i​n general, the risk is low – not much higher than running a standard web server.
(d) S​ince the organisation is already publishing Linked Open Data, we can assume that this isn’t the case.

Question. W​hy are a dereferencing URLs useful in Linked Data?

(a) B​ecause then there’s a single place for all information about the thing the URL refers to
(b) B​ecause we have to reference everything in academia
(c) W​e can’t know all the places where triples are published about a given URL, but if the URL itself can be dereferenced to more data about itself, that is a useful start

Answer: (c). D​ereferenceable URIs are one of Tim Berners-Lee’s 4 requirements for Linked Open Data.

Question. W​hy would Linked Data have been seen as a good way and web developers to publish information for use by search engines?

(a) D​ata can be embedded in web pages using RDFa and JSON-LD without the need for new technologies
(b) There is shared meaning for a set of core concepts
(c) O​ntologies can be extended without the need to change the HTML standard
(d) B​ecause Microsoft and Google have a long history of supporting Linked Open Data

Answer: (a, b, c).
(a) P​reviously, microdata was used, but RDFa and JSON-LD are more flexible (and more compatible with Linked Data)
(b) T​his is a core requirement (though there are other ways of delivering it)
(c) T​his presumably was a consideration. The ontology is community maintained and updated as necessary, steered by Microsoft, Google and others
(d) T​here’s not much prior history of significant involvement, and their use of LOD is fairly limited elsewhere.

Thursday 3 March 2022, 466 views

Leave a Reply

Your email address will not be published. Required fields are marked *