1. Sources of data
1.01 Key Concepts
- Find, describe and evaluate sources of data
- Understand different forms in which data may come
- Evaluate data-related access and reuse rights
1.02 Where does data come from?
Data can come form:
- New data – created for the sole purpose of the current application
- Pre-existing data – data that already existed prior to the application being created. Perhaps it’s internal legacy data, or it’s external data that can be acquired from another supplier
When it comes to new data, we can take different approaches:
- Adding data on-demand – For example, a hairdresser has bookings with clients. Either of these appointments is a new datum that gets added to the database on-demand, i.e. only when a customer makes an appointment
- Bulk data entry – Some systems can’t afford to have only parts of the data available. In such cases, we can either pay for data entry services or rely on some form of crowd-sourcing
- Pre-existing data – Whenever we have pre-existing data, it usually needs to be manipulated somehow in order to fit the new system. Some forms of data manipulation are:
- Extraction – data may already be in a spreadsheet or database and needs to be recovered, or extracted from the original source.
- Conversion – data may need to be converted into a new format or structure in order to fit new requirements.
- Cleaning – data may contain erroneous or unnecessary information. These need to be removed in order to prevent problems.
External sources of data are interesting because they reduce the cost of data entry or quality checks. When data is purchased from a supplier, it comes pre-cleaned and in a format that’s easy to consume. Moreover, we can also have the opportunity of acquiring data produced by experts in a given field.
Conversely, when we acquire data from an external source, we have little or no control over the quality of the data and its structure. The data may also be incomplete and/or ambiguous from our point view; i.e. the level of detail to which a particular piece of information is encoded may be different from what we need. As a final concern, there may be concerns of trustworthiness with
regards to the data.
Where can you find usable data?
Whilst many organisations and individuals make large amounts of data openly available, it can be hard to find. The Open Data Institute founded by Sir Tim Berners-Lee and Sir Nigel Shadbolt is dedicated to getting large-scale open publication of useful data started.
1.02 What does your data look like?
When modelling real-life data, we must consider what sort of information is necessary for the application. For example, the data required for a book may be:
Type Book Weight 557g Height 172mm Colour Red and Green Title Gardener's Calendar Authors Thomas Mawe, John Abercrombie Date 1803 Edition 17th
Some questions arise when it comes to which form of e.g. the title to store. From the point of view of finding it in a shelf “Gardener’s Calendar” is enough; from the point of view of comparison against other similar titles, a long form may be required.
1.03 Licenses, sharing and ethics
In academic and government circles, it’s common to make data as openly available as possible. That, however, doesn’t apply to all parts of government or the commercial world.
There are legal restrictions regarding the use of data which need to be considered.
The Linked Open Data Cloud project produces a graph of all the data openly available published in the Linked Data format.
Considering the size of the graph which contains but a subset of all openly available data, the question to ask is Why is so much data being shared for free if information is so valuable?
To put into perspective, a furniture catalogue from any given furniture company will contain many details about every item: price, sizes, materials, photos. In principle, the furniture could be copied from information that can be gathered from catalogues and manuals. However, the furniture company needs their products to be easy to find if they want to sell them.
The same argument can be used for many other industries: music industry, electronics, streaming services, etc.
To summarise, some of the reasons to share open data:
- To drive sales
- For the common good
- Contract requirements
Conversely, here are some reasons not to share open data:
- Restrictions on source data
- Control of use
- Value of the data
Q. What is copyleft?
A. Copyleft is a license which requires any works derived from the thing being licensed, or any redistribution of it, should use the same licence. Or in other words, Copyleft is free software license requiring copyright authors to permit some of their work to be reproduced.
Q. What is CC0?
A. CC0 is the Creative Commons license for public domain works (which have minimal restrictions or IP rights).
Tuesday 12 October 2021, 74 views
Next post: 2. What shape is your data?
- 26. A very good guide to linked data
- 25. Information Retrieval
- 24. Triplestores and SPARQL
- 23. Ontologies – RDF Schema and OWL
- 22. RDF – Remote Description Framework
- 21. Linked Data – an introduction
- 20. Transforming XML databases
- 19. Semantic databases
- 18. Document databases and MongoDB
- 17. Key/Value databases and MapReduce
- 16. Distributed databases and alternative database models
- 15. Query efficiency and denormalisation
- 14. Connecting to SQL in other JS and PHP
- 13. Grouping data in SQL
- 12. SQL refresher
- 11. Malice and accidental damage
- 10. ACID: Guaranteeing a DBMS against errors
- 9. Normalization example
- 8. Database normalization
- 7. Data integrity and security
- 6. Database integrity
- 5. Joins in SQL
- 4. Introduction to SQL
- 3. Relational Databases
- 2. What shape is your data?
- 1. Sources of data