In the MapReduce framework for performing a parallel computa
In the MapReduce framework, for performing a parallel computation, a crucial step involves an input that consists of a set of n key-value pairs, (k,v), for which we need to collect each subset of key-value pairs that have the same key, k, into a single file. Describe an efficient external-memory algorithm for constructing all such files. How many disk transfers does your algorithm perform?
Ps. This has to do with any of these: B-trees, external memory sorting or online caching algorithms
Solution
The new parallel-computing architecture, sometimes called cluster computing, is organized as follows. Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the intrarack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential., such as compute nodes and interconnection networks, a system has, the more frequently something in the system will not be working at any given time. For systems such as the principal failure modes are the loss of a single node the disk at that node crashes and the loss of an entire rack.
MapReduce is a style of computing that has been implemented in several systems, including Google’s internal implementation (simply called MapReduce) and the popular open-source implementation Hadoop which can be obtained, along with the HDFS file system from the Apache Foundation. You can use an implementation of MapReduce to manage many large-scale computations in a way that is tolerant of hardware faults. All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute. In brief, a MapReduce computation executes
Some number of Map tasks each are given one or more chunks from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs.The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way. The manner of combination of values is determined by the code written by the user for the Reduce function.
A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form, but normally the keys of input elements are not relevant and we shall tend to ignore them. Insisting on this form for inputs and outputs is motivated by the desire to allow composition of several MapReduce processes. The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not “keys”
