17. Key/Value databases and MapReduce
17.01 Key/Value databases
If distributed databases are complex, simplify your data structures.
With simpler structures can come simpler management and simpler parallelization. Key-value databases take this to almost the extreme.
When using key-value databases, we only have two columns. One column for the key and one for the value associated with that key. By doing this, we do away with the
concept of integrity and foreign keys, which greatly simplifies the process of distributing the database.
An example of a Key/Value database:
Key | Value |
---|---|
Bob | (123) 456-7890 |
Jane | (234) 567-8901 |
Tara | (345) 678-9012 |
Tiara | (456) 789-0123 |
Key-value databases imply the following:
- Easy parallel processing
- Easy partition
- Partition is always horizontal
- Processing must happen near the data where possible
17.02 MapReduce
We expect to keep the processing really close to the data when we’re doing row processing. What we do hand in hand with the key-value database is we expect to have a multi-stage processing where we do the stuff that’s very data intensive locally, and then we move derivative data somewhere else to do other stages of the processing.
One example of an algorithm for processing key/value datasets is MapReduce. MapReduce is an algorithm devised by Google.
The idea behind it is that many queries – including SQL queries – can be broken down into two phases:
- You gather and process information from a base table; and
- You process and summarise the result.
It consists of a map procedure, which performs filtering and sorting, and a reduce procedure, which performs a summary operation.
The map phase is carried out with direct access to the data. Usually, it loops over all the data. The output of this phase is a new key-value set.
The reduce phase is carried out by reducer workers. They summarize the data based on a key. We can assign reducer workers based on key, which enables highly parallel
processing.
Example. Here is a query to get a specific list of actors and the number of films from an actors database:
SELECT Actor1, COUNT(Title) FROM MovieLocations INNER JOIN Actors ON Actor1=Name WHERE DateOfBirth > "1950-01-01" GROUP BY Actor1;
We can look at this query as happening in two phases:
The initial stage – gathering
FROM MovieLocations INNER JOIN Actors ON Actor1=Name WHERE DateOfBirth > "1950-01-01"
We get the tables, combine and process them so we need access to the raw data.
We filter the data, we perform a join, we throw away any columns we’re not going to need for the next stage, and we produce some intermediate table of data.
The second stage – summarise result
SELECT Actor1, COUNT(Title) ... GROUP BY Actor1;
We group the data from the first phase based on some new conditions. Then we count and aggregate the results of that.
This is the two-stage process of the MapReduce algorithm. We have a map phase that maps to that from a WHERE clause and then the reduce phase that is the select and the group by Actor1.
The map phase is carried out with direct access to the data. The output of it is in SQL terms is another table, in a key-value database is going to be another key value set, because everything is going to be a key value set.
In the MapReduce algorithm, you take key-values in, you give key-values out, but they didn’t have to be the same keys of the same format of values.
Why would you use MapReduce?
- It’s very simple
- It’s designed for distributed data
- It’s very good for parallel processing
- You can move the data you need, not the data you have
- It can also recover from failure of all but the coordinating node easily
Tuesday 11 January 2022, 470 views
Next post: 18. Document databases and MongoDB Previous post: 16. Distributed databases and alternative database models
Databases and Advanced Data Techniques index
- 26. A very good guide to linked data
- 25. Information Retrieval
- 24. Triplestores and SPARQL
- 23. Ontologies – RDF Schema and OWL
- 22. RDF – Remote Description Framework
- 21. Linked Data – an introduction
- 20. Transforming XML databases
- 19. Semantic databases
- 18. Document databases and MongoDB
- 17. Key/Value databases and MapReduce
- 16. Distributed databases and alternative database models
- 15. Query efficiency and denormalisation
- 14. Connecting to SQL in other JS and PHP
- 13. Grouping data in SQL
- 12. SQL refresher
- 11. Malice and accidental damage
- 10. ACID: Guaranteeing a DBMS against errors
- 9. Normalization example
- 8. Database normalization
- 7. Data integrity and security
- 6. Database integrity
- 5. Joins in SQL
- 4. Introduction to SQL
- 3. Relational Databases
- 2. What shape is your data?
- 1. Sources of data
Leave a Reply