the red penguin

17. Key/Value databases and MapReduce

17.01 Key/Value databases

If distributed databases are complex, simplify your data structures.

With simpler structures can come simpler management and simpler parallelization. Key-value databases take this to almost the extreme.

When using key-value databases, we only have two columns. One column for the key and one for the value associated with that key. By doing this, we do away with the
concept of integrity and foreign keys, which greatly simplifies the process of distributing the database.

An example of a Key/Value database:

Key Value
Bob (123) 456-7890
Jane (234) 567-8901
Tara (345) 678-9012
Tiara (456) 789-0123

Key-value databases imply the following:

  • Easy parallel processing
  • Easy partition
  • Partition is always horizontal
  • Processing must happen near the data where possible

17.02 MapReduce

We expect to keep the processing really close to the data when we’re doing row processing. What we do hand in hand with the key-value database is we expect to have a multi-stage processing where we do the stuff that’s very data intensive locally, and then we move derivative data somewhere else to do other stages of the processing.

One example of an algorithm for processing key/value datasets is MapReduce. MapReduce is an algorithm devised by Google.

The idea behind it is that many queries – including SQL queries – can be broken down into two phases:

  • You gather and process information from a base table; and
  • You process and summarise the result.

It consists of a map procedure, which performs filtering and sorting, and a reduce procedure, which performs a summary operation.

The map phase is carried out with direct access to the data. Usually, it loops over all the data. The output of this phase is a new key-value set.

The reduce phase is carried out by reducer workers. They summarize the data based on a key. We can assign reducer workers based on key, which enables highly parallel

Example. Here is a query to get a specific list of actors and the number of films from an actors database:

SELECT Actor1, COUNT(Title)
FROM MovieLocations
ON Actor1=Name
WHERE DateOfBirth > "1950-01-01"
GROUP BY Actor1;

We can look at this query as happening in two phases:

The initial stage – gathering

FROM MovieLocations
ON Actor1=Name
WHERE DateOfBirth > "1950-01-01"

We get the tables, combine and process them so we need access to the raw data.

We filter the data, we perform a join, we throw away any columns we’re not going to need for the next stage, and we produce some intermediate table of data.

The second stage – summarise result

SELECT Actor1, COUNT(Title)
GROUP BY Actor1;

We group the data from the first phase based on some new conditions. Then we count and aggregate the results of that.

This is the two-stage process of the MapReduce algorithm. We have a map phase that maps to that from a WHERE clause and then the reduce phase that is the select and the group by Actor1.

The map phase is carried out with direct access to the data. The output of it is in SQL terms is another table, in a key-value database is going to be another key value set, because everything is going to be a key value set.

In the MapReduce algorithm, you take key-values in, you give key-values out, but they didn’t have to be the same keys of the same format of values.

Why would you use MapReduce?

  • It’s very simple
  • It’s designed for distributed data
  • It’s very good for parallel processing
  • You can move the data you need, not the data you have
  • It can also recover from failure of all but the coordinating node easily
Tuesday 11 January 2022, 358 views

Leave a Reply

Your email address will not be published. Required fields are marked *