Big Data BCA Unit4
Big Data BCA Unit4
HADOOP MAPREDUCE
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner. It divides a task
into small parts & assigns them to many computers. Later, the results are collected at one place
& integrated to form the result dataset.
Commodity Hardware
Commodity Hardware
Centralized User
System
Commodity Hardware
Fig.1. MapReduce Framework
Example:
Suppose we have a text file and we want to find that how many number of times a particular
word is appearing in the file.
Each map task in Hadoop is broken into following phases: record reader, mapper, combiner, and
partitioner. The output of map phase, called intermediate key and values are sent to the reducers.
The reduce tasks are broken into following phases: shuffle, sort, reducer and output format. The
map tasks are assigned by Hadoop framework to those DataNodes where the actual data to be
processed resides. This ensures that the data typically doesn’t have to move over the network to
save the network bandwidth and data is computed on the local machine itself so called map task
is data local.
Record Reader:
The record reader translates an input split generated by input format into records. The purpose of
record reader is to parse the data into record but doesn’t parse the record itself. It passes the data
to the mapper in form of key/value pair. Usually the key in this context is positional information
and the value is a chunk of data that composes a record. In the above fig. the input file would be
splitted like:
1,Dog Cat Mouse Where, 1 is ‘key’ & Dog Cat Mouse is ‘value’
2,Dog Dog Cat
3,Dog Cat Duck
Map:
Map function is the heart of mapper task, which is executed on each key/value pair from the
record reader to produce zero or more key/value pair, called intermediate pairs. The decision of
what is key/value pair depends on what the MapReduce job is accomplishing. The data is
BCA 6TH SEM UNIT-4 BIG DATA
grouped on key and the value is the information pertinent to the analysis in the reducer. In the
above fig. three mappers are used for counting the existence of each word.
Combiner:
Its an optional component but highly useful and provides extreme performance gain of
MapReduce job without any downside. Combiner is not applicable to all the MapReduce
algorithms but where ever it can be applied it is always recommended to use. It takes the
intermediate keys from the mapper and applies a user-provided method to aggregate values in a
small scope of that one mapper. e.g sending (hadoop, 3) requires fewer bytes than sending
(hadoop, 1) three times over the network.
Partitioner:
The partitioner takes the intermediate key/value pairs from mapper and split them into shards,
one shard per reducer. This randomly distributes the keyspace evenly over the reducer, but still
ensures that keys with the same value in different mappers end up at the same reducer. The
partitioned data is written to the local filesystem for each map task and waits to be pulled by its
respective reducer.
Reducer
Shuffle and Sort:
The reduce task start with the shuffle and sort step. This step takes the output files written by all
of the partitioners and downloads them to the local machine in which the reducer is running.
These individual data pipes are then sorted by keys into one larger data list. The purpose of this
sort is to group equivalent keys together so that their values can be iterated over easily in the
reduce task.
Reduce:
The reducer takes the grouped data as input and runs a reduce function once per key grouping.
The function is passed the key and an iterator over all the values associated with that key. A wide
range of processing can happen in this function, the data can be aggregated, filtered, and
combined in a number of ways. Once it is done, it sends zero or more key/value pair to the final
step, the output format.
Output Format:
The output format translate the final key/value pair from the reduce function and writes it out to
a file by a record writer. By default, it will separate the key and value with a tab and separate
record with a new line character.
BCA 6TH SEM UNIT-4 BIG DATA
The below tasks occur when the user submits a MapReduce job to Hadoop -
The local Job Client prepares the job for submission and hands it off to the Job Tracker.
The Job Tracker schedules the job and distributes the map work among the Task Trackers
for parallel processing.
The Job Tracker receives progress information from the Task Trackers.
Once the mapping phase results available, the Job Tracker distributes the reduce work
among the Task Trackers for parallel processing.
BCA 6TH SEM UNIT-4 BIG DATA
The Job Tracker receives progress information from the Task Trackers.
The following are the main phases in map reduce job execution flow -
1) Job Submission
2) Job Initialization
3) Task Assignment
4) Task Execution
5) Job/Task Progress
6) Job Completion
1) Job Submission: -
waitForCompletion method samples the job’s progress once a second after the job submitted.
If specified checks the directory already exists or is new and throws error if any issue
occurs with directory.
Computes input split and throws error if it fails because the input paths don’t exist.
Copies the resources to Job Tracker file system in a directory named after Job Id.
2) Job Initialization: -
Job initialization task and Job clean up task created and these are run by task trackers.
Job clean up tasks which delete the temporary directory after the job is completed.
3) Task Assignment:
The heartbeat is a communication channel and indicate whether it is ready to run a new task.
Job Tracker first selects a job to select the task based on job scheduling algorithms.
The default scheduler fills empty map task before reduce task slots.
The number of slots which a task tracker has depends on number of cores.
4) Task Execution: -
Task Tracker copies the job jar file from the shared filesystem (HDFS).
Task Tracker creates a local working directory and un-jars the jar file into the local file
system.
Task Tracker starts TaskRunner in a new JVM to run the map or reduce task.
Each task can perform setup and cleanup actions based on OutputComitter.
BCA 6TH SEM UNIT-4 BIG DATA
The input provided via stdin and get output via stdout from the running process even if
the map or reduce tasks ran via pipes or socket in case of streaming.
5) Job/Task Progress: -
If a task reports progress, it sets a flag to indicate the status change that sent to the Task
Tracker.
The flag is verified in a separate thread for every 3 seconds and it notifies the Task
Tracker of the current task status if set.
Task tracker sends its progress to Job Tracker over the heartbeat for every five seconds.
Job Tracker consolidate the task progress from all task trackers and keeps a holistic view
of job.
The Job receives the latest status by polling the Job Tracker every second.
6) Job Completion: -
Once the job completed, the clean-up task will get processed like below -
Once the task completed, Task Tracker sends the job completion status to the Job
Tracker.
Job Tracker cleans up its current working state for the job and instructs Task Trackers to
perform the same and also cleans up all the temporary directories.
HADOOP DAEMONS
1.NameNode: NameNode is used to hold the Metadata (information about the location, size of
files/blocks) for HDFS. The Metadata could be stored on RAM or Hard-Disk. There will always
be only one NameNode in a cluster. NameNode is single mode of complete failure in Hadoop.
2. Secondary NameNode: It is used as a backup for NameNode. It holds pretty much same
information as that of NameNode. If NameNode fails, this one comes into picture.
3.DataNode: While the NameNode stores Metadata, the actual data is stored on DataNode. The
number of DataNode depends on your data size. You can additionally add them. The DataNode
communicates to NameNode on frequent basis. (Every 3 seconds).
4. Job Tracker: NameNode and DataNodes store details and actual data on HDFS. This data is
also required to process as per client’s requirements. A Developer writes a code to process the
data. Processing of data can be done using MapReduce. MapReduce engine sends the code
across DataNodes, creating jobs. These jobs are to be continuously monitored; Job tracker
manages those jobs.
5. Task Tracker: The jobs given by Job trackers are actually performed by Task trackers. Each
DataNode will have one task tracker. Task trackers communicate with Job trackers to send
statuses of the jobs.
Standalone mode is the default mode in which Hadoop run. Standalone mode is mainly used for
debugging where we don’t really use HDFS.
We also don’t need to do any custom configuration in the files- mapred-site.xml, core-site.xml,
hdfs-site.xml.
Standalone mode is usually the fastest Hadoop modes as it uses the local file system for all the
input and output. Here is the summarized view of the standalone mode-
2. Pseudo-distributed Mode
The pseudo-distribute mode is also known as a single-node cluster where both NameNode and
DataNode will reside on the same machine.
In pseudo-distributed mode, all the Hadoop daemons will be running on a single node. Such
configuration is mainly used while testing when we don’t need to think about the resources and
other users sharing the resource.
In this architecture, a separate JVM is spawned for every Hadoop components as they could
communicate across network sockets, effectively producing a fully functioning and optimized
mini-cluster on a single host.
• Single Node Hadoop deployment running on Hadoop is considered as pseudo distributed mode
• All the master & slave daemons will be running on the same node
• Mainly used for testing purpose
• Replication Factor will be ONE for blocks
• Changes in configuration files will be required for all the three files- mapred-site.xml, core-
site.xml, hdfs-site.xml
This is the production mode of Hadoop where multiple nodes will be running. Here data will
be distributed across several nodes and processing will be done on each node.
Master and Slave services will be running on the separate nodes in fully-distributed Hadoop
Mode.