MapReduce: A Gentle Tutorial with Examples

February 20, 2018

What is MapReduce?

MapReduce is a programming framework that allows us to perform parallel processing on large data sets in a distributed environment. It is a data processing paradigm for condensing large volumes of data into useful aggregated results.

As the name suggests, Map and Reduce functions are performed on the data. Map function processes a block of data and generates a set of intermediate key/value pairs. Reduce function merges all the intermediate values associated with the same intermediate key into a single value.

The two operations are each designed such that efficient implementation is possible in distributed systems. Moreover, the two types of operations together provide a lot of flexibility and expressibility, so that a variety of real-world tasks can be implemented within the MapReduce paradigm.

Why MapReduce?

When the MapReduce framework did not exist, parallel and distributed processing were done in a traditional way. In the traditional way, the data was split into smaller blocks and stored in different machines. The individual machines then performed calculations on the data. The individual results from each machine is combined to obtain the final output.

There are some challenges with this traditional approach. The reliability of machines may delay the outcome of whole system. The splitting of data into individual machines need to be managed. There is a need of a mechanism to aggregate the results from each machine to produce the final output.

MapReduce framework overcomes these issues. For each new task, one only needs to write the map function and the reduce functions, and specify the input location and output location. The rest of the details, of splitting the data, passing the data from one machine to another, handling failures, etc are handled by the framework. MapReduce provides the flexibility to write the code logic without caring about the design issues of a distributed system.

Parallel Processing: Data is divided among multiple nodes and each node works simultaneously. So, MapReduce is based on divide and conquer paradigm, helping to process data using different machines. Due to parallel processing with multiple machines simultaneously, the time taken to process the data gets reduced by a tremendous amount.

Data Locality: With MapReduce framework, instead of moving data to processing unit, processing unit is moved to the data. Traditionally, vice versa was done (moving data to processing unit) to process it. Moving huge data to processing is costly and effects the network performance and may become a bottleneck also causing the master to get over-burdened. MapReduce overcomes these issues.

How MapReduce works

MapReduce takes a list of objects, runs some operation over each object in the list (map), to either produce a new list or calculate a single value (reduce).

The map job reads a block of data and processes it to produce key/value pairs as intermediate outputs. The output of the mapper is input to the reducer. Each reduce job receives all the key/value pairs for one specific key (this may come from multiple map jobs). The reducer then aggregates those intermediate data key/value pair into either (a) a single value for each key, or (b) a smaller list of key/value pairs, which is the final output.

Example 1: Word Count

Let’s take an example of how MapReduce works with word count. Given a large number of documents, we need to calculate the frequency (number of occurrences) of each word in all documents combined.

This example is quite realistic, and many real-world techniques such as search engine ranking, topic modeling, etc depend on it. Let’s see how we can compute the above using the MapReduce framework.

The input consists of a list of documents. Each document represents a single map job. One by one, work is distributed among the among all the map nodes. The map function receives a document, tokenizes it, and counts the number of occurrences of each word in the specific document (see example figure above). It then outputs a list of key/value pairs, where the key is the word and the value is the number of times it appears in the document.

After mapper, a partition process takes place which sorts and shuffles the key/value pair so that the pairs with the same key are sent to the corresponding reducer (this is handled by the MapReduce framework). Each reducer job receives the list of values (counts) for a specific key (word). It sums all the counts and outputs the total (see example figure above). The final key/value pairs are collected from all reduce jobs (by the MapReduce framework) and written as the output.

(input) <k1, v1>
-> map          -> list <k2, v2>
-> combine      -> <k2, list (v2)>
-> reduce       -> <k3, v3> (output)

Overall, the entire implementation would look something like the following. The implementation would be submitted to a MapReduce framework, which would handle the rest of the details.

input location := location where list of documents are stored
define map(document):
tokenize the document
return list of <word, count>, i.e. number of times each word appears
define reduce(word, counts):
return word, sum(counts)
output location := location where to store list of key values

Example 2: Matrix Multiplication

Lets take a look at a more complicated example - matrix multiplication. Large scale matrix multiplication is also a common real-world task. For example, the PageRank algorithm in Google search requires multiple matrix multiplication operations on a dataset of trillions of web pages.

Suppose matrix M is denoted as mij and matrix N is denoted as njk. The product P = MN will be matrix P denoted as pik, where

$$P_{(i,k)} = \sum_{j=1}m_{ij}*n_{jk}$$

We represent matrix M with tuples (i, j, mij) and matrix N with tuples (j, k, njk). Most large matrices are sparse, i.e., a large number of cells have value zero. When we represent matrices in this form, we do not need to keep entries for the cells that have values of zero. This saves large amount of disk space.

Broad strategy: Our overall strategy is going to be to use (i, k) as the intermediate keys. And to provide each reduce job all the information it needs to compute pik, i.e. the i-th row of M and the k-th column of N.

Step 1: Map function

Map function receives the individual tuples from matrices M and N.

1. for each element mij of M, produce (key/value) pairs as ((i, k), (M, j, mij)), for k = 1, 2, 3, .. upto the number of columns of N
2. for each element njk of N, produce (key/value) pairs as ((i, k), (N, j, njk)), for i = 1, 2, 3, .. upto the number of rows of M

Step 2: Reduce function

Reduce function receives set of (key/value) pairs. Each key (i, k), has a list with values (M, j, mij) and (N, j, njk) for all possible values of j. We need to compute j=1 mij * njk.

For each key (i, k), the reduce job can compute this as follows

1. sort values begin with M by j in listM
2. sort values begin with N by j in listN
3. multiply mij and njk for jth value of each list
4. return the sum of all values

Step-by-Step illustration of example 2

Suppose we have two matrices, M, 2×3 matrix, and N, 3×2 matrix with the desired product as follows:

$$\begin{bmatrix} 1&2&3\\4&5&6\end {bmatrix}*\begin{bmatrix} a&b\\ c&d\\e&f \end{bmatrix} = \begin{bmatrix} 1a+2c+3e&1b+2d+3f\\ 4a+5c+6e&4b+5d+6f \end{bmatrix}$$

Map function

For matrix M, map will produce key/value pairs as follows:

(i,k),(M,j,mij)
m11 = 1
(1,1),(M,1,1) ... (k = 1)
(1,2),(M,1,1) ... (k = 2)
m12 = 2
(1,1),(M,2,2) ... (k = 1)
(1,2),(M,2,2) ... (k = 2)
...
...
...
m23 = 6
(2,1),(M,3,6) ... (k = 1)
(2,2),(M,3,6) ... (k = 2)

For matrix N, map task will produce key/value pairs as follows:

(i,k),(N,j,njk)
n11 = a
(1,1),(N,1,a) ... (i = 1)
(2,1),(N,2,a) ... (i = 2)
n21 = c
(1,1),(N,1,a) ... (i = 1)
(2,1),(N,2,c) ... (i = 2)
n31 = e
(1,1),(N,3,e) ... (i = 1)
(2,1),(N,3,e) ... (i = 2)
...
...
...
n32 = f
(1,2),(N,3,f) ... (i = 1)
(2,2),(N,3,f) ... (i = 2)

After combine operation the key/value pairs will look as follows:

((i,k),[(M,j,mij),(M,j,mij),........,(N,j,njk),(N,j,njk),......])
(1,1),[(M,1,1),(M,2,2),(M,3,3),(N,1,a),(N,2,c),(N,3,e)]
(1,2),[(M,1,1),(M,2,2),(M,3,3),(N,1,b),(N,2,d),(N,3,f)]
(2,1),[(M,1,4),(M,2,5),(M,3,6),(N,1,a),(N,2,c),(N,3,e)]
(2,2),[(M,1,4),(M,2,5),(M,3,6),(N,1,b),(N,2,d),(N,3,f)]

Note that the entries for the same key are grouped in the same list.

key (1,1); value [(M,1,1), (M,2,2), (M,3,3), (N,1,a, (N,2,c), (N,3,e)] 

Reduce function

Reduce task takes the key/value pairs as the input and process one key at a time. For each key it divides the values into two separate lists for M and N.

Reduce task sorts values beginning with M in one list and values beginning with N in another list as follows:

listM = [(M,1,1),(M,2,2),(M,3,3)]
listN = [(N,1,a),(N,2,c),(N,3,e)]

then sums up the multiplication of mij and njk for each j as follows:

P(1, 1) = 1a + 2c + 3e

The same computation applied to all input entries of reduce task.

P(1,1) = 1a + 2c + 3e
P(1,2) = 1b + 2d + 3f
P(2,1) = 4a + 5c + 6e
P(2,2) = 4b + 5d + 6f

The product matrix P of MN is as desired.

An Aside: Combiners

A Combiner is also known as a semi-reducer. A combiner accepts the inputs from the mapper and passes the output key/value pairs to the reducer. The primary function of combiner is to reduce workload of reducer. In a MapReduce program, 20% of the work is done in the Map Stage, which is also known as the data preparation stage, which works in parallel. 80% of the work is done in Reduce stage which is known as the calculation stage, and it is not parallel and is slower than the Map phase. To reduce time, some work in the Reduce phase can be done in the combiner phase.

Conclusion

In this tutorial, we got introduced to the MapReduce framework. The framework allows us to perform parallel processing on large data sets in a distributed environment, while only requiring the programmer to define the code logic of the map function and the reduce function. We took a detailed look at how the MapReduce framework works with some illustrative examples - word counts, and matrix multiplication.