MapReduce: Simplified Data Processing on Large ClustersReference: Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
This paper belongs to the suggested reading of MIT 6.824 Distributed Systems.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.
- Motivation: addressing the issues of how to parallelize the computation, distribute the data, and handle failures
- The abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages
- Major Contributions: (1) a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations; (2) an implementation of this interface that achieves high performance on large clusters of commodity PCs.
2. Programming Model
Almost everyone is familiar with this.
- Environment: large clusters of commodity PCs connected together with switched Ethernet
3.1 Execution Overview
When the user program calls the MapReduce function, the following sequence of actions occurs:
- Split the input files into M pieces and start up many copies of the program on a cluster of machines;
- There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task;
- A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory;
- Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
- When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
- The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
- When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.
After successful completion, the output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user).
3.2 Master Data Structures
- For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine(for non-idle tasks).
- For each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. The information is pushed incrementally to workers that have in-progress reduce tasks.
3.3 Fault Tolerance
- The master pings every worker periodically.
- Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.
- Single Point Failure
Semantics in the Presence of Failures
- Deterministic Functions: the same output as would have been produced by a non-faulting sequential execution of the entire program, by employing atomic commits of map and reduce task outputs
- Non-deterministic Functions: the output of a particular reduce task R1 is equivalent to the output for R1 produced by a sequential execution of the non-deterministic program
- Motivation: network bandwidth is a relatively scarce resource in our computing environment
- Mechanism: same machine/same rack/others
3.5 Task Granularity
- Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery
- In practice, M is chosen so that each individual task is roughly 16 MB to 64 MB of input data(locality), and R is a small multiple of the number of worker machines we expect to use
3.6 Backup Tasks
- Straggler: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation.
- Speculative Execution
Input and Output Types
Skipping Bad Records
Counters: The counter values from individual worker machines are periodically propagated to the master(piggybacked on the ping response)