0% found this document useful (0 votes)
73 views32 pages

Big Data Analytics (2171607) : Chapter - 1 Mapreduce

This document discusses MapReduce and provides examples of its core functionality and processes. It contains 3 key points: 1. MapReduce is a programming framework that allows distributed and parallel processing of large datasets. It has two main steps - the map step where data is divided and processed in parallel, and the reduce step where the results are combined. 2. The map step takes input data and produces intermediate key-value pairs. The reduce step combines these intermediate pairs to produce the final output. 3. An example word count process using MapReduce is described where input text is divided and each word is mapped to a key-value pair with count 1. The pairs are then shuffled and sorted, with reducers counting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views32 pages

Big Data Analytics (2171607) : Chapter - 1 Mapreduce

This document discusses MapReduce and provides examples of its core functionality and processes. It contains 3 key points: 1. MapReduce is a programming framework that allows distributed and parallel processing of large datasets. It has two main steps - the map step where data is divided and processed in parallel, and the reduce step where the results are combined. 2. The map step takes input data and produces intermediate key-value pairs. The reduce step combines these intermediate pairs to produce the final output. 3. An example word count process using MapReduce is described where input text is divided and each word is mapped to a key-value pair with count 1. The pairs are then shuffled and sorted, with reducers counting
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

1

BIG DATA ANALYTICS(2171607)

CHAPTER – 1
PROF. M DHANALAKSHMI,
ASST. PROF.,
IT DEPT,
SCET, SURAT. MapReduce

Chapter - 1 BIG DATA ANALYTICS


HADOOP COMPONENTS
2

Chapter - 1 BIG DATA ANALYTICS


MapReduce Core Functionality
3
 Code usually written in Java.
 Two fundamental pieces:

 1. Map step:
 Master node takes large problem input and slices it into
smaller sub problems; distributes these to worker nodes.

 Worker node may do this again; leads to a multi-level tree


structure.
 Worker processes smaller problem and hands back to
master.
Chapter - 1 BIG DATA ANALYTICS
MapReduce Core Functionality
4
 Code usually written in Java - though it can be written in
other languages with the Hadoop Streaming API

 Two fundamental pieces:

 2. Reduce Step:

 Master node takes the answers to the sub problems and


combines them in a predefined way to get the
output/answer to original problem.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Core Functionality
5
 MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Core Functionality
6
 Map Job:
 A block of data is read and processed to produce key-value pairs
as intermediate outputs.

 The output of a Mapper or map job (key-value pairs) is input to


the Reducer.

 Reduce Job:
 The reducer receives the key-value pair from multiple map jobs.

 The reducer aggregates those intermediate data tuples (intermediate


key-value pair) into a smaller set of tuples or key-value pairs which is
the final output.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Classes
7
 MapReduce majorly has the following three Classes.

 1. Mapper Class
 2. Reducer Class
 3. Driver Class

Chapter - 1 BIG DATA ANALYTICS


MapReduce Classes
8
 1. Mapper Class:

 The first stage in Data Processing using MapReduce is


the Mapper Class. 

 Here, RecordReader processes each Input record and


generates the respective key-value pair.

 Hadoop’s Mapper store saves this intermediate data into


the local disk.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Classes
9
 1. Mapper Class:

 Input Split:

 It is the logical representation of data. It represents a


block of work that contains a single map task in the
MapReduce Program.

 RecordReader:

 It interacts with the Input split and converts the obtained


data in the form of Key-Value Pairs.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Classes
10
 2. Reducer Class:
 The Intermediate output generated from the mapper is fed to the
reducer which processes it and generates the final output which is
then saved in the HDFS.

 3. Driver Class:
 Major component in a MapReduce job is a Driver Class. 

 It is responsible for setting up a MapReduce Job to run-in Hadoop. 

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
11
 Map Function Performs following steps:
 Splitting

 Mapping

 Splitting step takes input Data Set from source and breaks up into
smaller Data Sets.

 Mapping step takes these Sub-Data Sets and performs required


computation on each Sub-DataSet.

 The output of this map function is a set of key and value pairs as
<Key,Value>.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
12

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
13
 Output of first step of MapReduce:
 Map function Output = List of <Key,Value> Pairs.

 Shuffle Function:
 Also called Combine Function.

 It performs following sub-steps:


 Merging

 Sorting

 It takes output of Map Function as input and performs above


sub-steps on every key-value pair.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
14
 Output of first step of MapReduce:
 Map function Output = List of <Key,Value> Pairs.

 Shuffle Function:
 Also called Combine Function.

 It performs following sub-steps:


 Merging

 Sorting

 It takes output of Map Function as input and performs above


sub-steps on every key-value pair.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
15
 Merging step merges all key-value pair that have same
keys.
 It gives <Key,List<Value>> pair as output.
 Output of merging step is given as input to Sorting step
which sorts all key-value pair by using keys.
 It also returns <Key,List<Value> with sorted key-value
pair as output.
 Shuffle function gives sorted <Key,List<Value>> pairs to
next step.
 Output of second step of MapReduce:
 Shuffle Function Output = List of <Key,List<Value>>Pairs.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
16
 Reduce Function:
 Takes list of sorted <Key,List<Value>> pairs as input from
Shuffle function and performs reduce operation.

 Output of final step of MapReduce:

Reduce Function = List of <Key, Value> Pairs.

First step <Key,Value> pairs are different from


final step <Key,Value> pairs.

Final <Key,Value> pairs are sorted and computed.

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
17

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
18

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
19

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
20

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
21

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
22

Chapter - 1 BIG DATA ANALYTICS


MapReduce Steps
23

Chapter - 1 BIG DATA ANALYTICS


Example MapReduce WordCount Process
24

Chapter - 1 BIG DATA ANALYTICS


Example MapReduce WordCount Process
25
 First, divide the input into three splits as shown in the figure.

 This will distribute the work among all the map nodes.

 Then, we tokenize the words in each of the mappers and give a


hardcoded value (1) to each of the tokens or words.

 The reason behind giving a hardcoded value equal to 1 is that every


word, in itself, will occur once.

 A list of key-value pair will be created where the key is nothing but
the individual words and value is one. 

Chapter - 1 BIG DATA ANALYTICS


Example MapReduce WordCount Process
26
 The first line (Dear Bear River) we have 3 key-value pairs – Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all the
nodes.

 After the mapper phase, a partition process takes place where sorting
and shuffling happen so that all the tuples with the same key are sent
to the corresponding reducer.

 After the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc. 

Chapter - 1 BIG DATA ANALYTICS


Example MapReduce WordCount Process
27
 Each Reducer counts the values which are present in that
list of values.

 Reducer gets a list of values which is [1,1] for the key


Bear.

 Then, it counts the number of ones in the very list and


gives the final output as – Bear, 2.

 Finally, all the output key/value pairs are then collected


and written in the output file.

Chapter - 1 BIG DATA ANALYTICS


MapReduce - InputFormat
28
 How the input files are split up and read is
defined by the InputFormat.
 InputFormat is a class that does the following:
 Selects the files that should be used for input.
 Defines the InputSplits that break a file.
 Provides a factory for RecordReader objects that
read the file.

Chapter - 1 BIG DATA ANALYTICS


MapReduce – Input Splits
29
 An Input Split describes a unit of work that
comprises a single map task in a MapReduce
program.
 By dividing the file into splits, it allows several map
tasks to operate on a single file in parallel.
 If the file is very large , this can improve
performance through parallelism.
 Each map task corresponds to a single input split.

Chapter - 1 BIG DATA ANALYTICS


MapReduce – RecordReader
30
 The input split defines a slice of work but does not
describe how to access it.
 The RecordReader class actually loads data from its
source and converts it into (k,v) pairs suitable for
reading by Mappers.
 The RecordReader is invoked repeatedly on the
input until the entire split is consumed.

Chapter - 1 BIG DATA ANALYTICS


MapReduce – Reducer
31
 The mapper performs the first phase of the MapReduce
program.
 A new instance of Mapper is created for each split.
 The reducer performs the second phase of the MapReduce
program.
 A new instance of Reduce is created for each partition.
 For each key in the partition assigned to a Reducer, the
Reducer is called once.

Chapter - 1 BIG DATA ANALYTICS


Summary of Data Flow in MapReduce
32

Chapter - 1 BIG DATA ANALYTICS

You might also like