Big Data Notes Unit-3
Big Data Notes Unit-3
MAP-REDUCE
• MapReduce is a big data analysis model that processes data sets using a parallel
algorithm on computer clusters, typically Apache Hadoop clusters or cloud systems like
Amazon Elastic MapReduce (EMR) clusters.
• MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner.
• The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations.
• The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce
it to equivalent tasks for providing less overhead over the cluster network and to reduce
the processing power. The MapReduce task is mainly divided into two phases Map Phase
and Reduce Phase.
• MapReduce is essential to the operation of the Hadoop framework and a core component.
While “reduce tasks” shuffle and reduce the data, “map tasks” deal with separating and
mapping the data.
• MapReduce makes concurrent processing easier by dividing petabytes of data into
smaller chunks and processing them in parallel on Hadoop commodity servers. In the
end, it collects all the information from several servers and gives the application a
consolidated output.
• For example, let us consider a Hadoop cluster consisting of 20,000 affordable
commodity servers containing 256MB data blocks in each. It will be able to process
around five terabytes worth of data simultaneously. Compared to the sequential
processing of such a big data set, the usage of MapReduce cuts down the amount of time
needed for processing.
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing. There can be multiple clients available that continuously send jobs for
processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result
of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-value pairs. The
input to the map may be a key-value pair where the key can be the id of some kind of
address and value is the actual value that it keeps. The Map() function will be executed in
its memory repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled
and sort and send to the Reduce() function. Reducer aggregate or group the data based on
its key-value pair as per the reducer algorithm written by the developer.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
JOB SCHEDULING
• Job scheduling is the process of allocating system resources to many different tasks
by an operating system (OS). The system handles prioritized job queues that are
awaiting CPU time and it should determine which job to be taken from which queue
and the amount of time to be allocated for the job. This type of scheduling makes sure
that all jobs are carried out fairly and on time.
• Job scheduling is performed using job schedulers. Job schedulers are programs that
enable scheduling and, at times, track computer “batch” jobs, or units of work like the
operation of a payroll program. Job schedulers have the ability to start and control
jobs automatically by running prepared job-control-language statements or by means
of similar communication with a human operator.
• Most OSs like Unix, Windows, etc., include standard job-scheduling abilities. A
number of programs including database management systems (DBMS), backup,
enterprise resource planning (ERP) and business process management (BPM) feature
specific job-scheduling capabilities as well.
TASK EXECUTION