Unit 3
Unit 3
MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
To take advantage of parallel processing of Hadoop, the query must be in MapReduce form.
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples.
The Algorithm
● Generally MapReduce paradigm is based on sending the computer to where the data
resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
➢ Map stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.
➢ Reduce stage: This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored
in the HDFS.
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
● After completion of the given tasks, the cluster collects and reduces the data to form
an appropriate result, and sends it back to the Hadoop server.
RecordReader
Apache Hadoop can process any arbitrary data like log files, text files, structured data etc. We
know that actual data is stored inHDFS while InputSplit is the logical partition of the data.
InputSplit is the chunk of data that processed by single map (i.e. one Mapper can process one
InputSplit at a time) Now each split is required to divide into the records. Please note that
inputsplit do not contain actual data rather they contain the references to the actual data
(HDFS blocks).
InputFormat is responsible for validating the input data, creating the inputsplit and divide
them into the records.
RecordReader reads the data from inputsplit (record) and converts them into key-value pair
for the input to the Mapper class.
Combiner Phase
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is
large and the data transferred to the reduce task is high.
The following Map Reduce task diagram shows the COMBINER PHASE.
The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.
The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.
Reducer Phase
The Reducer phase takes each key-value collection pair from the Combiner phase, processes
it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.
A Hadoop cluster can comprise of a single node (single node cluster) or thousands of nodes.
Once you have installed Hadoop you can try out the following few basic commands to work
with HDFS:
▪ hadoop fs -ls
▪ hadoop fs -put <path_of_local><path_in_hdfs>
▪ hadoop fs -get <path_in_hdfs><path_of_local>
▪ hadoop fs -cat <path_of_file_in_hdfs>
▪ hadoop fs -rmr <path_in_hdfs>
With the help of the following diagram, let us try and understand the different components of
a Hadoop Cluster:
The above diagram depicts a 4 Node Hadoop Cluster
In the diagram the Name Node, Secondary Name Node and the Job Tracker are running on a
single machine. Usually in production clusters having more that 20-30 node, the daemons run
on separate nodes.
Hadoop follows Master-Slave architecture. As mentioned earlier, a file in HDFS is split into
blocks and replicated across Data nodes in a Hadoop cluster. You can see that the three files
A, B and C have been split across with a replication factor of 3 across the different Data
nodes.
Name Node
The Name Node in Hadoop is the node where Hadoop stores all the location information of
the files in HDFS. In other words, it holds the metadata for HDFS. Whenever a file is placed
in the cluster a corresponding entry of it location is maintained by the Name Node. So, for the
files A, B and C we would have something as follows in the Name Node:
This information is required when retrieving data from the cluster as the data is spread across
multiple machines. The Name Node is a Single Point of Failure for the Hadoop Cluster.
IMPORTANT - The Secondary Name Node is not a failover node for theName Node.
The secondary name node is responsible for performing periodic housekeeping functions for
the Name Node. It only creates checkpoints of the file system present in the Name Node.
Data Node
The Data Node is responsible for storing the files in HDFS. It manages the file blocks within
the node. It sends information to the Name Node about the files and blocks stored in that
node and responds to the Name Node for all file system operations.
Job Tracker
Job Tracker is responsible for taking in requests from a client and assigning Task
Trackers with tasks to be performed. The Job Tracker tries to assign tasks to the Task
Tracker on the Data Node where the data is locally present (Data Locality). If that is not
possible it will at least try to assign tasks to Task Trackers within the same rack. If for some
reason the node fails the Job Tracker assigns the task to another Task Tracker where the
replica of the data exists since the data blocks are replicated across the Data Nodes. This
ensures that the job does not fail even if a node fails within the cluster.
Task Tracker
Task Tracker is a daemon that accepts tasks (Map, Reduce and Shuffle) from the Job
Tracker. The Task Tracker keeps sending a heart beat message to the Job Tracker to notify
that it is alive. Along with the heartbeat it also sends the free slots available within it to
process tasks. Task Tracker starts and monitors the Map & Reduce Tasks and sends
progress/status information back to the Job Tracker.
1. A Client (usually a Map Reduce program) submits a job to the Job Tracker.
2. The Job Tracker get information from the Name Node on the location of the data
within the Data Nodes. The Job Tracker places the client program (usually a jar file
along with the configuration file) in the HDFS. Once placed, Job Tracker tries to
assign tasks to Task Trackers on the Data Nodes based on data locality.
3. The Task Tracker takes care of starting the Map tasks on the Data Nodes by picking
up the client program from the shared location on the HDFS.
4. The progress of the operation is relayed back to the Job Tracker by theTask Tracker.
5. On completion of the Map task an intermediate file is created on the local file system
of the Task Tracker.
6. Results from Map tasks are then passed on to the Reduce task.
7. The Reduce tasks works on all data received from map tasks and writes the final
output to HDFS.
8. After the task complete the intermediate data generated by the Task Tracker is
deleted.
A very important feature of Hadoop to note here is that, the program goes to where the data is
and not the way around, thus resulting in efficient processing of data.
Google App Engine (GAE) is a Platform as a Service (PaaS) cloud-based Web hosting
service on Google's infrastructure. For an application to run on GAE, it must comply with
Google's platform standards, which narrows the range of applications that can be run and
severely limits those applications' portability.
Google App Engine lets you run web applications on Google's infrastructure.
● Easy to build.
● Easy to maintain.
● Easy to scale as the traffic and storage needs grow
Java
● App Engine runs JAVA apps on a JAVA 7 virtual machine (currently supports JAVA
6 as well).
● Uses JAVA Servlet standard for web applications:
● WAR (Web Applications Archive) directory structure.
● Servlet classes
● Java Server Pages (JSP)
● Static and data files
● Deployment descriptor (web.xml)
● Other configuration files
Python
● Local development servers are available to anyone for developing and testing local
applications.
● Only white listed applications can be deployed on Google App Engine.
Google’s Go