What Is MapReduce?

MapReduce refers to a framework that runs on a computational cluster to mine large datasets. The name derives from the application of map() and reduce() functions repurposed from functional programming languages.

  • “Map” applies to all the members of the dataset and returns a list of results
  • “Reduce” collates and resolves the results from one or more mapping operations executed in parallel
  • Very large datasets are split into large subsets called splits 
  • A parallelized operation performed on all splits yields the same results as if it were executed against the larger dataset before turning it into splits
  • Implementations separate business logic from multiprocessing logic
  • MapReduce framework developers focus on process dispatching, locking, and logic flow
  • App developers focus on implementing the business logic without worrying about infrastructure or scalability issues 

Implementation patterns

The Map(k1, v1) -> list(k2, v2) function is applied to every item in the split. It produces a list of (k2, v2) pairs for each call. The framework groups all the results with the same key

together in a new split. 

The Reduce(k2, list(v2)) -> list(v3) function is applied to each intermediate results split to produce a collection of values v3 in the same domain. This collection may have zero or more values. The desired result consists of all the v3 collections, often aggregated into one result file.

Tags