0% found this document useful (0 votes)
16 views32 pages

BDA Unit 2 Notes

The document provides a comprehensive overview of the MapReduce programming model within the Hadoop ecosystem, detailing its core components such as the Map and Reduce functions, Job Tracker, and Task Tracker. It explains the phases involved in both mapping and reducing tasks, including Record Reader, Mapper, Combiner, Partitioner, Shuffle, Sort, and Output format, while also discussing the benefits of compression in MapReduce. Additionally, it includes a sample WordCount program and outlines various InputFormats used in MapReduce applications.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views32 pages

BDA Unit 2 Notes

The document provides a comprehensive overview of the MapReduce programming model within the Hadoop ecosystem, detailing its core components such as the Map and Reduce functions, Job Tracker, and Task Tracker. It explains the phases involved in both mapping and reducing tasks, including Record Reader, Mapper, Combiner, Partitioner, Shuffle, Sort, and Output format, while also discussing the benefits of compression in MapReduce. Additionally, it includes a sample WordCount program and outlines various InputFormats used in MapReduce applications.

Uploaded by

Mr. Praneeth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Developing MapReduce Program

Map Reduce acts as a core component in Hadoop Ecosystem as it facilitates the


logic of processing. To make it simple, Map Reduce is a software framework
which enables us in writing applications that process large data sets using
distributed and parallel algorithms in a Hadoop environment.
Parallel processing feature of Map Reduce plays a crucial role in Hadoop
ecosystem. It helps in performing Big data analysis using multiple machines in the
same cluster.
In the Map Reduce program, we have two functions; one is Map, and the other is
Reduce.

Map function: It converts one set of data into another, where individual elements
are broken down into tuples. (key /value pairs).

Reduce function: It takes data from the Map function as an input. Reduce function
aggregates & summarizes the results produced by Map function.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm
written by the developer.
Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task
Tracker running on the same data node since there can be hundreds of data
nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker
is deployed on each of the nodes available in the cluster that executes the
Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are
generated during or after the job execution are stored on Job History Server.
MapReduce:
MapReduce is a programming model for data processing. Hadoop can run
MapReduce programs written in Java, Ruby and Python.
MapReduce programs are inherently parallel, thus very large scale data analysis
can be done fastly.
In MapReduce programming, Jobs(applications) are split into a set of map tasks
and reduce tasks.
Map task takes care of loading, parsing, transforming and filtering. The
responsibility of reduce task is grouping and aggregating data that is produced by
map tasks to generate final output.
Each map task is broken down into the following phases:
1. Record Reader
2. Mapper
3. Combiner
4.Partitioner.
The output produced by the map task is known as intermediate <keys,value> pairs.
These intermediate <keys, value> pairs are sent to reducer.
The reduce tasks are broken down into the following phases:
1. Shuffle
2. Sort
3. Reducer
4. Output format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This way, Hadoop ensures data locality. Data locality means that data is
not moved over network; only computational code moved to process data which
saves network bandwidth.
Mapper Phases:
Mapper maps the input <keys, value> pairs into a set of intermediate <keys,value>
pairs.
Each map task is broken into following phases:
1. RecordReader: converts byte oriented view of input in to Record oriented view
and presents it to the Mapper tasks. It presents the tasks with keys and values.
i) InputFormat: It reads the given input file and splits using the method getsplits().
ii) Then it defines RecordReader using createRecordReader() which is responsible
for generating <keys, value> pairs.
2. Mapper: Map function works on the <keys, value> pairs produced by
RecordReader and generates intermediate (key, value) pairs.
Methods:
- protected void cleanup(Context context): called once at the end of task.
- protected void map(KEYIN key, VALUEIN value, Context context): called once
for each key-value pair in input split.
- void run(Context context): user can override this method for complete control
over execution of Mapper.
- protected void setup(Context context): called once at beginning of task to
perform required activities to initiate map() method.
3. Combiner: It takes intermediate <keys, value> pairs provided by mapper and
applies user specific aggregate function to only one mapper. It is also known as
local Reducer.
We can optionally specify a combiner using
Job.setCombinerClass(ReducerClass) to perform local aggregation on intermediate
outputs.
4. Partitioner: Take intermediate <keys, value> pairs produced by the mapper,
splits them into partitions the data using a user-defined condition. The default
behavior is to hash the key to determine the reducer. User can control by using the
method:
int getPartition(KEY key, VALUE value, int numPartitions )
Reducer Phases:
1. Shuffle & Sort:
 Downloads the grouped key-value pairs onto the local machine, where the
Reducer is running.
 The individual <keys, value> pairs are sorted by key into a larger data list.
 The data list groups the equivalent keys together so that their values can be
iterated easily in the Reducer task.
2. Reducer:
 The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them.
 Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing.
 Once the execution is over, it gives zero or more key-value pairs to the final
step.
Methods:
- protected void cleanup(Context context): called once at tend of task.
- protected void reduce(KEYIN key, VALUEIN value, Context context): called
once for each key-value pair.
- void run(Context context): user can override this method for complete control
over execution of Reducer.
- protected void setup(Context context): called once at beginning of task to
perform required activities to initiate reduce() method.
3. Output format:
 In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record
writer.

Compression: In MapReduce programming we can compress the output file.


Compression provides two benefits as follows:
 Reduces the space to store files.
 Speeds up data transfer across the network.
We can specify compression format in the Driver program as below:
conf.setBoolean(“mapred.output.compress”,true);
conf.setClass(“mapred.output.compression.codec”,GzipCodec.class,Compres
sionCodec.class);
Here, codec is the implementation of a compression and decompression algorithm,
GzipCodec is the compression algorithm for gzip.

Anatomy of MapReduce Code

Hadoop launches the Anatomy of MapReduce job by first splitting (logically) the
input dataset into multiple data splits. Each map task is then scheduled to one
TaskTracker node where the data split resides. A Task Scheduler is responsible for
scheduling the execution of the tasks as far as possible in a data-local manner.
In a typical Anatomy of MapReduce job, input files are read from the Hadoop
Distributed File System (HDFS). Data is usually compressed to reduce file sizes.
After decompression, serialized bytes are transformed into Java objects before
being passed to a user-defined map() function. Conversely, output records are
serialized, compressed, and eventually pushed back to HDFS. However, behind
this apparent simplicity, the processing is broken down into many steps and has
hundreds of different tunable parameters to fine-tune the job’s running
characteristics.
Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that
run in a distributed fashion on a cluster of computers. Each task work on a small
subset of the data it has been assigned so that the load is spread across the cluster.
The input to a MapReduce job is a set of files in the data store that are spread out
over the HDFS. In Hadoop, these files are split with an input format, which defines
how to separate a files into input split. You can assume that input split is a byte-
oriented view of a chunk of the files to be loaded by a map task.
The map task generally performs loading, parsing, transformation and filtering
operations, whereas reduce task is responsible for grouping and aggregating the
data produced by map tasks to generate final output. This is the way a wide range
of problems can be solved with such a straightforward paradigm, from simple
numerical aggregation to complex join operations and cartesian products.
Each map task in Hadoop is broken into following phases: record reader, mapper,
combiner, Hadoop partitioner. The output of map phase, called intermediate key
and values are sent to the reducers. The reduce tasks are broken into following
phases: shuffle, sort, reducer and output format. The map tasks are assigned
by Hadoop framework to those Data Nodes where the actual data to be processed
resides. This ensures that the data typically doesn’t have to move over the
network to save the network bandwidth and data is computed on the local machine
itself so called map task is data local.
Map Phase:
Record Reader:
The record reader translates an input split generated by input format into records.
The purpose of record reader is to parse the data into record but doesn’t parse the
record itself. It passes the data to the mapper in form of key/value pair. Usually the
key in this context is positional information and the value is a chunk of data that
composes a record.
Map:
Map function is the heart of mapper task, which is executed on each key/value pair
from the record reader to produce zero or more key/value pair, called intermediate
pairs. The decision of what is key/value pair depends on what the MapReduce job
is accomplishing. The data is grouped on key and the value is the information
pertinent to the analysis in the reducer.
Combiner:
Its an optional component but highly useful and provides extreme performance
gain of MapReduce job without any downside. Combiner is not applicable to all
the MapReduce algorithms but where ever it can be applied it is always
recommended to use. It takes the intermediate keys from the mapper and applies a
user-provided method to aggregate values in a small scope of that one mapper. e.g
sending (hadoop, 3) requires fewer bytes than sending (hadoop, 1) three times over
the network.
Partitioner:
The Hadoop partitioner takes the intermediate key/value pairs from mapper and
split them into shards, one shard per reducer. This randomly distributes the
keyspace evenly over the reducer, but still ensures that keys with the same value in
different mappers end up at the same reducer. The partitioned data is written to the
local filesystem for each map task and waits to be pulled by its respective reducer.

Reducer
Shuffle and Sort:
The reduce task start with the shuffle and sort step. This step takes the output files
written by all of the hadoop partitioners and downloads them to the local machine
in which the reducer is running. These individual data pipes are then sorted by keys
into one larger data list. The purpose of this sort is to group equivalent keys
together so that their values can be iterated over easily in the reduce task.
Reduce:
The reducer takes the grouped data as input and runs a reduce function once per
key grouping. The function is passed the key and an iterator over all the values
associated with that key. A wide range of processing can happen in this function,
the data can be aggregated, filtered, and combined in a number of ways. Once it is
done, it sends zero or more key/value pair to the final step, the output format.
Output Format:
The output format translate the final key/value pair from the reduce function and
writes it out to a file by a record writer. By default, it will separate the key and
value with a tab and separate record with a new line character.
MapReduce program for WordCount problem

Write a MapReduce program for WordCount problem.


import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount
{
public static class WCMapper extends Mapper <Object, Text, Text,
IntWritable>

{
final static IntWritable one = new IntWritable(1);
Text word = new Text();
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
StringTokenizer itr = new tringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class WCReducer extends Reducer<Text, IntWritable, Text,


IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context ) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

MapReduce paradigm for WordCount

Types of InputFormat in MapReduce


In Hadoop, there are various MapReduce types for InputFormat that are used for
various purposes. Let us now look at the MapReduce types of InputFormat:
FileInputFormat
It serves as the foundation for all file-based InputFormats. FileInputFormat also
provides the input directory, which contains the location of the data files. When we
start a MapReduce task, FileInputFormat returns a path with files to read. This
InpuFormat will read all files. Then it divides these files into one or more
InputSplits.
TextInputFormat
It is the standard InputFormat. Each line of each input file is treated as a separate
record by this InputFormat. It does not parse anything. TextInputFormat is suitable
for raw data or line-based records, such as log files. Hence:
 Key: It is the byte offset of the first line within the file (not the entire file
split). As a result, when paired with the file name, it will be unique.
 Value: It is the line's substance. It does not include line terminators.

KeyValueTextInputFormat
It is comparable to TextInputFormat. Each line of input is also treated as a separate
record by this InputFormat. While TextInputFormat treats the entire line as the
value, KeyValueTextInputFormat divides the line into key and value by a tab
character ('/t'). Hence:
 Key: Everything up to and including the tab character.
 Value: It is the remaining part of the line after the tab character.

SequenceFileInputFormat
It's an input format for reading sequence files. Binary files are sequence files.
These files also store binary key-value pair sequences. These are block-compressed
and support direct serialization and deserialization of a variety of data types. Hence
Key & Value are both user-defined.
SequenceFileAsTextInputFormat
It is a subtype of SequenceFileInputFormat. The sequence file key values are
converted to Text objects using this format. As a result, it converts the keys and
values by running 'tostring()' on them. As a result,
SequenceFileAsTextInputFormat converts sequence files into text-based input for
streaming.
NlineInputFormat
It is a variant of TextInputFormat in which the keys are the line's byte offset. And
values are the line's contents. As a result, each mapper receives a configurable
number of lines of TextInputFormat and KeyValueTextInputFormat input. The
number is determined by the magnitude of the split. It is also dependent on the
length of the lines. So, if we want our mapper to accept a specific amount of lines
of input, we use NLineInputFormat.
N- It is the number of lines of input received by each mapper.
Each mapper receives exactly one line of input by default (N=1).
Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs
are distributed to one mapper. The second two key-value pairs are given to another
mapper.
DBInputFormat
Using JDBC, this InputFormat reads data from a relational Database. It also loads
small datasets, which might be used to connect with huge datasets from HDFS
using multiple inputs. Hence:
 Key: LongWritables

 Value: DBWritables.

Output Format in MapReduce


The output format classes work in the opposite direction as their corresponding
input format classes. The TextOutputFormat, for example, is the default output
format that outputs records as plain text files, although key values can be of any
type and are converted to strings by using the toString() method. The tab character
separates the key-value character, but this can be changed by modifying the
separator attribute of the text output format.
SequenceFileOutputFormat is used to write a sequence of binary output to a file for
binary output. Binary outputs are especially valuable if they are used as input to
another MapReduce process.
DBOutputFormat handles the output formats for relational databases and HBase. It
saves the compressed output to a SQL table.

MapReduce Features:
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a
large number of nodes in a cluster. This allows it to handle massive datasets,
making it suitable for Big Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing
of data. It automatically detects and handles node failures, rerunning tasks on
available nodes as needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node
where it is stored, minimizing data movement across the network and improving
overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated
with distributed computing, allowing developers to focus on their data processing
logic rather than low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make
storing and processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous
execution of independent operations. As a result, programs run faster due to
parallel processing, making it easier for a process to handle each job. Thanks to
parallel processing, these distributed tasks can be performed by multiple
processors. Therefore, all software runs faster.

Combiner Optimization:
Combiner always works in between Mapper and Reducer.
The output produced by the Mapper is the intermediate output in terms of key-
value pairs which is massive in size.
If we directly feed this huge output to the Reducer, then that will result in
increasing the Network Congestion. So to minimize this Network congestion we
have to put combiner in between Mapper and Reducer.
Combiner is also a class in our java program like Map and Reduce class that is
used in between this Map and Reduce classes. Combiner helps us to produce
abstract details or a summary of very large datasets.
// Key Value pairs generated for data Geeks For Geeks For
(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
we have 4 key-value pairs generated by each of the Mapper. since these
intermediate key-value pairs are not ready to directly feed to Reducer because that
can increase Network congestion so Combiner will combine these intermediate
key-value pairs before sending them to Reducer.
The combiner combines these intermediate key-value pairs as per their key. For the
above example for data Geeks For Geeks For the combiner will partially reduce
them by merging the same pairs according to their key value and generate new
key-value pairs as shown below
// Partially reduced key-value pairs with combiner
(Geeks,2)
(
For,2)
With the help of Combiner, the Mapper output got partially reduced in terms of
size(key-value pairs) which now can be made available to the Reducer for better
performance.
Here is a brief summary on how MapReduce Combiner works −
• A combiner does not have a predefined interface and it must implement the
Reducer interface’s reduce() method.
• A combiner operates on each map output key. It must have the same output
key-value types as the Reducer class.
• A combiner can produce summary information from a large dataset because
it replaces the original Map output.
Although, Combiner is optional yet it helps segregating data into multiple groups
for Reduce phase, which makes it easier to process.
MapReduce Combiner Implementation
The following example provides a theoretical idea about combiners. Let us assume
we have the following input text file named input.txt for MapReduce.
What do you mean by Object
What do you know about Java
What is Java Virtual Machine
How Java enabled High Performance

The important phases of the MapReduce program with Combiner are discussed
below
Advantage of combiners
• Reduces the time taken for transferring the data from Mapper to Reducer.
• Reduces the size of the intermediate output generated by the Mapper.
• Improves performance by minimizing Network congestion.
• Reduces the workload on the Reducer:
• Improves fault tolerance
• Improves scalability
• Helps optimize MapReduce jobs:Combiners can be used to optimize
MapReduce jobs by performing some preliminary data processing before the
data is sent to the Reducer. This can help reduce the amount of processing
required by the Reducer, which can help improve performance and reduce
overall processing time.
Disadvantage of combiners
• The intermediate key-value pairs generated by Mappers are stored on Local
Disk and combiners will run later on to partially reduce the output which
results in expensive Disk Input-Output.
• The map-Reduce job can not depend on the function of the combiner
because there is no such guarantee in its execution.
• Increased resource usage
• Combiners may not always be effective
• Combiners can introduce data inconsistencies
• Increased complexity

Joins in MapReduce
The join operation is used to combine two or more database tables based on
foreign keys. In general, companies maintain separate tables for the customer and
the transaction records in their database. And, many times these companies need to
generate analytic reports using the data present in such separate tables. Therefore,
they perform a join operation on these separate tables using a common column
(foreign key), like customer id, etc., to generate a combined table. Then, they
analyze this combined table to get the desired analytic reports.
Just like SQL join, we can also perform join operations in MapReduce on different
data sets. There are two types of join operations in MapReduce:
Map Side Join:
As the name implies, the join operation is performed in the map phase itself.
Therefore, in the map side join, the mapper performs the join and it is mandatory
that the input to each map is partitioned and sorted according to the keys.
we apply join operation, the job will be assigned to a Map Reduce task which
consists of two stages- a ‘Map stage’ and a ‘Reduce stage’. A mapper’s job during
Map Stage is to “read” the data from join tables and to “return” the ‘join
key’ and ‘join value’ pair into an intermediate file. Further, in the shuffle stage,
this intermediate file is then sorted and merged. The reducer’s job during reduce
stage is to take this sorted result as input and complete the task of join.

• Map-side Join is similar to a join but all the task will be performed by the
mapper alone.
• The Map-side Join will be mostly suitable for small tables to optimize the task.
Assume that we have two tables of which one of them is a small table. When we
submit a map reduce task, a Map Reduce local task will be created before the
original join Map Reduce task which will read data of the small table from HDFS
and store it into an in-memory hash table. After reading, it serializes the in-
memory hash table into a hash table file. In the next stage, when the original join
Map Reduce task is running, it moves the data in the hash table file to the Hadoop
distributed cache, which populates these files to each mapper’s local disk. So all
the mappers can load this persistent hash table file back into the memory and do
the join work as before. The execution flow of the optimized map join is shown in
the figure below. After optimization, the small table needs to be read just once.
Also if multiple mappers are running on the same machine, the distributed cache
only needs to push one copy of the hash table file to this machine
Advantages of using map side join:
• Map-side join helps in minimizing the cost that is incurred for sorting and
merging in the shuffle and reduce stages.
• Map-side join also helps in improving the performance of the task by
decreasing the time to finish the task.
Disadvantages of Map-side join:
• Map side join is adequate only when one of the tables on which you perform
map-side join operation is small enough to fit into the memory. Hence it is
not suitable to perform map-side join on the tables which are huge data in
both of them.
perform the Map-side Join on the two tables to extract the list of departments in
which each employee is working.
Here, the second table dept is a small table. Always the number of department will
be less than the number of employees in an organization.

perform the same task with the help of normal Reduce-side join.
Map-reduce join has completed its job without the help of any reducer whereas
normal join executed this job with the help of one reducer.
Hence, Map-side Join is your best bet when one of the tables is small enough to fit
in memory to complete the job in a short span of time.
Reduce Side Join:
As the name suggests, in the reduce side join, the reducer is responsible for
performing the join operation. It is comparatively simple and easier to
implement than the map side join as the sorting and shuffling phase sends the
values having identical keys to the same reducer and therefore, by default, the
data is organized for us.
Let’s make two datasets: One with Bank customer data and the other with their
credit transaction data both having the customer id as a common key. The
Attributes in the datasets are

 Customer Details: Customer ID, Name, Age

 Credit transaction Details: Transaction ID, Date, Customer ID, Transaction


amount
I would like to calculate the number of credit transactions made by each customer
along with the total amount. So, the output after joining is Customer name,
transactions count, and total transaction amount.
Reduce side Join

Each of the two datasets will have its own mapper, with one for customer details
input and the other for transaction details input. Also, we have one reducer to
produce our desired output.

1. Customer details mapper:

 It will read the input, taking one tuple at a time.

 Then, it will tokenize each word in that tuple and fetch the customer ID along
with the name of the person.

 Make customer ID as the key.

 Make customer's name as value.


 We are appending the “customers” string before the customer name to indicate
it is from customer details.

 Therefore, my mapper for customer details will produce the following


intermediate key-value pair: [customer ID, customers name]

 Example: [1, customers Jhon], [2, customers Ravi], etc.

2. Credit transaction details mapper:

 It will read the input, taking one tuple at a time.

 Then, it will tokenize each word in that tuple and fetch the customer ID along
with the transaction amount of the customer.

 Make customer ID as the key

 Make credit amount as value.

 We are appending the “transaction” string before the credit amount to indicate
it is from transaction details.

 Therefore, my mapper for transaction details will produce the following


intermediate key-value pair: [customer ID, transaction Credit amount]

 Example: [1, transaction 50], [2, transaction 150], etc.


3. Sort and shuffle: The sorting and shuffling phase will generate an array list
of values corresponding to each key.

4. Reduce join reducer:

 It will read the input from the sort and shuffle phase as a key & list of values
where the key is nothing but the customer ID. The list of values will have the
input from both datasets.

 Now, it will loop through the values present in the list of values in the reducer.

 Then, it will split the list of values and check whether the value is of
transaction details type or customer details type.

 If it is of the credit transaction details type, then perform the following steps:
 1. Increase the counter value by one to calculate the number of transactions
made by each customer.
2. Also, add all the transaction amounts to get the total transaction amount of
a particular customer.
 On the other hand, if the value is of customer details type, then store it in a
string variable.

 After that assign the name as the key and the number of transactions along
total amount as the value in the output key-value pair.

 Finally, write the output key-value pair in the output folder in HDFS.

 This is how the output will look.

Secondary Sorting:
The ability to sort data is one of the main features of MapReduce capabilities.
MapReduce can be sorted in the following ways:
Partial Sort: This is the default sort where MapReduce sorts the data by keys.
Individual output files are sorted but no globally sorted file is combined and
produced.
Total Sort: Produces globally sorted output files. It does this by using a partitioner
that respects the total order of the output and the partition sizes must be fairly even.
Secondary Sort: Sorts the records by key in the map phase instead of the reduce
phase.
Secondary sort is a technique that allows the Map Reduce programmer to control
the order that the values show up within a reduce function call.
Let’s also assume that our secondary sorting is on a composite key made out of
Last Name and First Name.

The partitioner and the group comparator use only natural key, the partitioner uses
it to channel all records with the same natural key to a single reducer. This
partitioning happens in the Map Phase, data from various Map tasks are received
by reducers where they are grouped and then sent to the reduce method. This
grouping is where the group comparator comes into picture, if we would not have
specified a custom group comparator then Hadoop would have used the default
implementation which would have considered the entire composite key, which
would have led to incorrect results.
Finally, just reviewing the steps involved in a MR Job and relating it to secondary
sorting should help us clear out the lingering doubts.

Pipelining MapReduce Jobs


Various phases of Map Reduce job execution such as Input Files, Input Format,
InputSplit, RecordReader, Mapper, Combiner, Partitioner, Shuffling and Sorting,
Reducer, RecordWriter , Output Format and Output Files.

Input Files: The data for a Map Reduce task is stored in input files and these input
files are generally stored in HDFS.
Input Format: Input Format defines how the input files are split and read. It
selects the files or other objects that are used for input. Input Format creates
InputSplit.
Java Class: org.apache.hadoop.mapreduce.InputFormat<K,V>
Methods: List<InputSplit>, RecordReader<K,V>
Input Split: It is the logical representation of data. It represents the data which is
processed by an individual Mapper. When you save any file in Hadoop, the file is
broken down into blocks of 128 MB (default configuration).
One map task is created for each Input Split. The split is divided into records and
each record will be processed by the mapper. It is always beneficial to have
multiple splits, because the time taken to process a split is small as compared to the
time taken for processing of the whole input. When the splits are smaller, the
processing is better load balanced since it will be processing the splits in parallel.
Record Reader: It communicates with InputSplit in and converts the data into
key-value pairs suitable for reading by the mapper. By default, it uses
TextInputFormat for converting data into a key-value pair. Record Reader
communicates with the InputSplit until the file reading is not completed. It assigns
byte offset (unique number) to each line present in the file. Then, these key-value
pairs are sent to the mapper for further processing.
Mapper: Mapper processes each input record and generates an intermediate key-
value pair. These <key, value> pairs can be completely different from the input
pair. In mapper task, the output is full collection of all these <key, value> pairs.
The intermediate output is stored on the local disk as this is temporary data and
writing on HDFS will create unnecessary copies. In the event of node failure
before the map output is consumed by the reduce task, Hadoop reruns the map task
on another node and re-creates the map output.
No. of Mapper= {(total data size)/ (input split size)}
Mappers output is passed to the combiner for further process.
Java Class:
org.apache.hadoop.mapreduce.Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT
>
Combiner: The combiner is also known as ‘Mini-Reducer’. Combiner is
optional and performs local aggregation on the mappers output, which helps to
minimize the data transfer between Mapper and Reducer, thereby improving the
overall performance of the Reducer. The output of Combiner is then passed to the
Partitioner.
Java Class: used in conjunction with different java classes. e.g.-
JobConf.getCombinerClass() — usage with JobConf class
Partitioner: Partitioner comes into picture if we are working on more than one
reducer. Partitioner takes the output from Combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. Hash
Partitioner is the default Partitioner in Map Reduce which computes a hash value
for the key and assigns the partition based on this result. The total number of
Partitioner that run in Hadoop is equal to the number of reducers which is set by
JobConf.setNumReduceTasks() method. By hash function, key is used to derive
the partition. According to the key-value, each mapper output is partitioned and
records having the same key value go into the same partition (within each mapper),
and then each partition is sent to a reducer. This partitioning specifies that all the
values for each key are grouped together and make sure that all the values of a
single key go to the same reducer, thus ensuring even distribution of the map
output over the reducer.
Java Class: org.apache.hadoop.mapreduce.Partitioner<KEY,VALUE>
Shuffling and Sorting: The shuffling is the physical movement of the data which
is done over the network. As shuffling can start even before the map phase has
finished so this saves some time and completes the tasks in lesser time.The keys
generated by the mapper are automatically sorted by Map Reduce. Values passed
to each reducer are not sorted and can be in any order. Sorting helps reducer to
easily distinguish when a new reduce task should start.
Reducer: It takes the set of intermediate key-value pairs produced by the mappers
as the input and then runs a reducer function on each of them to generate the
output. The output of the reducer is the final output, which is stored in HDFS.
Reducers run in parallel as they are independent of one another. The user decides
the number of reducers. By default, the number of reducers is 1. Increasing the
number of Reducers increases the overhead, increases load balancing and lowers
the cost of failures.
Java Class: Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Record Writer: It writes these output key-value pair from the Reducer phase to
the output files. The implementation to be used to write the output files of the job
is defined by Output Format.
Output Format: The way these output key-value pairs are written in output files
by RecordWriter is determined by the Output Format. The final output of reducer
is written on HDFS by Output Format. Output Files are stored in a File System.
Java Class: org.apache.hadoop.mapreduce.OutputFormat<K,V>
Method: RecordWriter<K,V>
Output Files: The output is stored in these Output Files and these Output Files are
generally stored in HDFS.

You might also like