25. Information Retrieval
When we talk about data, we’ve talked about it as synonymous with facts. eg Marlon Brando starred in whatever film, the moon has a diameter of X km, etc.
But information that people work with often comes in a less tangible form, in the form of documents and those documents might be an entire novel or a piece of music, or a collection of music. It might be recordings. It might be videos and those are definitely data.
It’s very difficult to decompose them into individual facts about that document. When we’re trying to answer users need for information from a database of documents or single document, we need a slightly different approach.
We need to consider the approaches to document-based, and particularly multimedia-based, retrieval provided by information retrieval as a discipline.
25.02 Querying Rich Data
When querying rich data, such as music, some of the requests coming from users may not be clearly reducible to a simple yes/no answer.
Different parts or regions of the data may answer the query differently, which means that we need to know more details about the user’s needs before we can find an acceptable answer to the query.
The information we need maybe not necessarily be explicit in the signal (data retrieved), which creates more difficulties.
e.g. if I request a quiet piece of piano music I could retrieve a quiet part which could quickly turn into a loud part.
Vannevar Bush, an American engineer, inventor, and science administrator, stated in the article entitled As We May Think published in The Atlantic July 1945:
The real heart of the matter of selection, however, goes deeper than a lag in the adoption of mechanisms by libraries, or a lack of development of devices for their use.
Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are file alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass.
The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain.
The core ideas of Information Retrieval are
- User has an information need
- A need is expressed as a query
- Query is executed over data by an Information Retrieval system
- Results are usually returned in the form of a Ranked List. We expect the results to contain a group of things that are true.
If the query itself is correctly satisfied, the document that is retrieved may still not be relevant to the user, because it doesn’t match the information need.
This would then suggest that perhaps the way in which we get the user to formulate queries is not well suited to the information needs of the user.
Question: In some cases, summarising data means averaging over the whole thing, in other cases, it involves responding to extreme or outlier values.
In which of the following scenarios is a single local value of crucial importance to the summary? (Select all that apply)
I can’t stand bloody violence or gore. Is a given film suitable?
Yes – one scene is probably enough to cause nightmares
I have a morbid fear of the sound of bagpipes. Am I safe to listen to a given recording?
Yes – as with extreme violence one localised incident of bagpipe playing is likely to be enough
I will be travelling a long distance driving a tall, wide vehicle. Is a given route suitable?
Yes – a route that is mostly suitable, but which has a low bridge in the middle is wholly unsuitable, and would have to be changed.
Is it safe to give a given patient penicillin?
No – One event of allergy would probably be enough that it would never be administered again, so we should only expect one example.
I will be driving a long distance. What is the fastest route to my destination?
No – an average over the whole journey is probably good enough – usually this is what we would optimise for.
25.03 What is a feature?
Much of what we do in information retrieval tasks is based not around searching the documents themselves, but using something called a feature.
So what is a feature? Let’s start from the point of view of the documents that we have to search. If we have an audio signal, it’ll be maybe two tracks -left and right – 44,000 samples per second, each one a pressure reading. If we look at those individually, in the same way that we would say a row of a database, we’re not going to get something very meaningful.
If we search for all pressure readings that are above 5dB, that won’t tell us much about the music or the speech that we’re trying to analyse.
The same is true of natural language processing. If we’re looking at some text, and we have every byte of the text, or every couple of bytes as a character, then taking the characters individually doesn’t tell us very much. The same is true for an image. Individual pixels will not tell us a lot.
What we need to do is construct higher-level structures above the data itself, and those are what we call features.
The idea is to look at one step higher level than the raw data. For example, we might count the number of times that the signal goes across zero in an audio signal (zero-crossings). For NLP we might look at whole words, anything surrounded by spaces or punctuation marks, so we can recognise the difference between an apostrophe at the end, and a quotation mark.
We break things down into words, or tokens, aggregate them into words. We might, instead of looking at individual pixels, we can look at regions that have the same colour – eg little islands of black pixels that are separated from each other.
Once we’ve got those, that gets us a certain way into things, we can start to answer slightly more relevant, more interesting questions, but it won’t get us all the way to answering most information needs. We’ll need to step back again. We might say, estimate the pitch of the sounds that we’re hearing. Sometimes called f0 estimation. We might look at combinations of words, or do some stemming to work out, not just that this says book and this says books, but they’re both expressions of the same core word. We might, having identified these areas of similar colour, start to analyse shapes. These are slightly higher level features.
A feature is, therefore, a measurable property derived entirely from the signal. It could be Boolean (eg “is a letter followed by a space?”, numerical or categorical; simple scalars or a vector.
Features help us from low-level signals to high-level concepts, simplify the data itself — which reduces the amount of data we have to process; embody our expectations of salience, and can be re-weighted based on task and user.
At some point we’re going to have to bridge what’s called the Semantic Gap, i.e. connecting user’s queries to the information system so data can be retrieved. Usually, this is where Machine Learning comes into the picture due to the complexity of the task.
Friday 4 March 2022, 797 views
Next post: 26. A very good guide to linked data
Previous post: 24. Triplestores and SPARQL
- 26. A very good guide to linked data
- 25. Information Retrieval
- 24. Triplestores and SPARQL
- 23. Ontologies – RDF Schema and OWL
- 22. RDF – Remote Description Framework
- 21. Linked Data – an introduction
- 20. Transforming XML databases
- 19. Semantic databases
- 18. Document databases and MongoDB
- 17. Key/Value databases and MapReduce
- 16. Distributed databases and alternative database models
- 15. Query efficiency and denormalisation
- 14. Connecting to SQL in other JS and PHP
- 13. Grouping data in SQL
- 12. SQL refresher
- 11. Malice and accidental damage
- 10. ACID: Guaranteeing a DBMS against errors
- 9. Normalization example
- 8. Database normalization
- 7. Data integrity and security
- 6. Database integrity
- 5. Joins in SQL
- 4. Introduction to SQL
- 3. Relational Databases
- 2. What shape is your data?
- 1. Sources of data