16. Distributed databases and alternative database models
A distributed database is basically a type of database which consists of multiple databases that are connected with each other and are spread across different physical locations. The data that is stored on various physical locations can thus be managed independently of other physical locations. The communication between databases at different physical locations is thus done by a computer network.
- This database can be easily expanded as data is already spread across different physical locations.
- The distributed database can easily be accessed from different networks.
- This database is more secure in comparison to centralised databases.
- This database is very costly and it is difficult to maintain because of its complexity.
- In this database, it is difficult to provide a uniform view to user since it is spread across different physical locations.
With the rise of Web and Big Data grew the need for distributed databases. In a distributed database system, the user should not be aware that a distributed database is being used. It should behave the same way as if data was held all in one place.
There are a few reasons why distributing the database may become an interesting proposition:
- Parallelization – By distributing the data, we can enable parallelization of our operations without locks getting in the way.
- No single point of failure – If one instance of the database fails, we can replace it with another system on the fly with minimal to no user impact.
- Dividing large dataset – In case of a large dataset, we may not want to process it all in one place.
Why might it be important for there to be no difference (from a user’s point of view) between a distributed and a non-distributed database?
Many of the benefits of distributing a database are associated with being able to move processing and storage around as needed. The less you have to know about the architecture to use the database, the more flexible the system is to be changed dynamically.
16.02 Approaches to distributing RDBMS
The user should see no difference between a distributed and a non-distributed system.
So what are the requirements for a distributed database system?
- Local autonomy – Sites should operate independently, i.e. one site should not be able to interfere with another’s operations. Moreover, no site should rely on another for its own operation.
- No centralization – No single site should be able to control transactions or operations. No single site failure should break the system.
- Continuous operation – The system is available most of the time and reliable.
- Location independence – The user should not know where the data might be located. Data can be moved from one location to another without changing functionality.
- Partition independence – The user doesn’t need to know how the data is partitioned and the data can be partitioned without changing functionality.
- Replication independence – Distributed databases often require duplicate copies of data. The user need not be aware that replication is used.
- Distributed queries – Execute queries close to the data.
- DBMS independence – In theory, we should be able to distribute data over different DBMS systems.
- Other important requirements
- Hardware independence
- OS independence
- Network independence
A few important concepts are summarised below:
- Partitioning – How will the data be divided? Sometimes called fragmentation.
- Vertical partitioning – Dividing table by columns. Normalisation is an example of vertical partitioning.
- Horizontal partitioning – Dividing table by rows. eg if we have a table with 1,000,000 data points we could set up 10 nodes which carry 100,000 data points each instead.
Example. I have a table of customers’ names and details and partition them for distribution based on what letter their family name begins with. This is an example of horizontal partitioning. You can remember this by visualising cutting a paper copy of the table with a pair of scissors – the cuts are horizontal.
- Catalogue management – How is information about the data distributed?
- Recovery control – Transactions usually use two-phase protocol. This requires on site to act as a co-ordinator in any given transaction.
- Brewer’s Conjecture – Brewer said we have three goals which are “in tension” – each goal fights against the other ones. The goals pull in different directions and no distributed system can fully satisfy all three.
- Consistency – All parts of the database should converge on a consistent state.
- Availability – Every request should result in a response eventually.
- Partition tolerance – If a network flaw breaks the network into separate subnets, the database should run and recover.
Tuesday 11 January 2022, 216 views
Next post: 17. Key/Value databases and MapReduce
Previous post: 15. Query efficiency and denormalisation
- 26. A very good guide to linked data
- 25. Information Retrieval
- 24. Triplestores and SPARQL
- 23. Ontologies – RDF Schema and OWL
- 22. RDF – Remote Description Framework
- 21. Linked Data – an introduction
- 20. Transforming XML databases
- 19. Semantic databases
- 18. Document databases and MongoDB
- 17. Key/Value databases and MapReduce
- 16. Distributed databases and alternative database models
- 15. Query efficiency and denormalisation
- 14. Connecting to SQL in other JS and PHP
- 13. Grouping data in SQL
- 12. SQL refresher
- 11. Malice and accidental damage
- 10. ACID: Guaranteeing a DBMS against errors
- 9. Normalization example
- 8. Database normalization
- 7. Data integrity and security
- 6. Database integrity
- 5. Joins in SQL
- 4. Introduction to SQL
- 3. Relational Databases
- 2. What shape is your data?
- 1. Sources of data