Unit IV Notes
Unit IV Notes
Introduction:
• This is much like Java application servers that invoke servlets upon
receiving an HTTP request; the container is responsible for setup and
teardown as well as pro- viding a runtime environment for user-supplied
code.
1. Simplicity of development
2. Scale
Since tasks do not communicate with one another explicitly and do not
share state, they can execute in parallel and on separate machines.
Additional machines can be added to the cluster and applications
immediately take advantage of the addi- tional hardware with no change at
all. MapReduce is designed to be a share noth-ing system.
4. Fault tolerance
Failure is not an exception; it’s the norm. MapReduce treats failure as a
first-class citizen and supports reexecution of failed tasks on healthy
worker nodes in the cluster. Should a worker node fail, all tasks are
assumed to be lost, in which case they are simply rescheduled elsewhere.
The unit of work is always the task, and it either completes successfully or it
fails completely.
• In MapReduce, users write a client application that submits one or more jobs
that contain user-supplied map and reduce code and a job configuration file
to a cluster of machines.
• The job contains a map function and a reduce function, along with job con-
figuration information that controls various aspects of its execution.
• The framework handles breaking the job into tasks, scheduling tasks to run
on machines, monitoring each task’s health, and performing any necessary
retries of failed tasks.
• A job processes an input dataset specified by the user and usually outputs
one as well. Commonly, the input and output datasets are one or more files
on a distributed filesystem.
Stages of MapReduce:
A MapReduce job is made up of four distinct stages, executed in order: client job
submission, map task execution, shuffle and sort, and reduce task execution.
Client applications can really be any type of application the developer desires,
from command line tools to services. The MapReduce framework provides a set of
APIs for submitting jobs and interacting with the cluster. The job itself is made up
of code written by a developer against the MapReduce APIs and the configuration
which specifies things such as the input and output datasets.
The client application submits a job to the cluster using the framework APIs. A
master process, called the jobtracker in Hadoop MapReduce, is responsible for
accepting these submissions (more on the role of the jobtracker later). Job
submission occurs over the network, so clients may be running on one of the
cluster nodes or not; it doesn’t matter. The framework gets to decide how to
split the input dataset into chunks, or input splits, of data that can be processed in
parallel. In Hadoop MapReduce, the component that does this is called an input
format, and Hadoop comes with a small library of them for common file formats.
In order to better illustrate how MapReduce works; use a simple application log
processing example where count all events of each severity within a window of
time. Let’s assume 100 GB of logs in a directory in HDFS. A sample of log records
might look something like this:
For each input split, a map task is created that runs the user-supplied map
function on each record in the split. Map tasks are executed in parallel. This means
each chunk of the input dataset is being processed at the same time by various
machines that make up the cluster. It’s fine if there are more map tasks to execute
than the cluster can handle. They’re simply queued and executed in whatever order
the framework deems best. The map function takes a key-value pair as input and
produces zero or more intermediate key-value pairs.
The input format is responsible for turning each record into its key-value pair
representation. For now, trust that one of the built-in input formats will turn each
line of the file into a value with the byte offset into the file provided as the key.
Getting back to our example, to write a map function that will filter records for
those within a specific timeframe, and then count all events of each severity. The
map phase is where we’ll perform the filtering. We’ll output the severity and the
number 1 for each record that we see with that severity.
function map(key, value) {
// Example key: 12345 - the byte offset in the file (not really interesting).
// Example value: 2012-02-13 00:23:54-0800 [INFO - com.company.app1.Main]↵
// Application started!
// Do the nasty record parsing to get dateTime, severity,
// className, and message.
(dateTime, severity, className, message) = parseRecord(value);
// If the date is today...
if (dateTime.date() == '2012-02-13') {
// Emit the severity and the number 1 to say we saw one of these records.
emit(severity, 1);
}
}
Given the sample records earlier, our intermediate data would look as follows:
DEBUG, 1
INFO, 1
INFO, 1
INFO, 1
WARN, 1
The key INFO repeats, which makes sense because our sample contained three
INFO records that would have matched the date 2012-02-13. It’s perfectly legal to
output the same key or value multiple times. The other notable effect is that the
output records are not in the order to expect. In the original data, the first record
was an INFO record, followed by WARN, but that’s clearly not the case here. This is
because the framework sorts the output of each map task by its key. Just like
outputting the value 1 for each record, the rationale behind sorting the data will
become clear in a moment.
• If a reducer sees a key, it will see all values for that key. For example, if a
reducer receives the INFO key, it will always receive the three number 1 values.
• A key will be processed by exactly one reducer. This makes sense given the
pre- ceding requirement.
The next phase of processing, called the shuffle and sort, is responsible for
enforcing these guarantees. The shuffle and sort phase is actually performed by
the reduce tasks before they run the user’s reduce function. When started, each
reducer is assigned one of the partitions on which it should work. First, they copy
the intermediate key-value data from each worker for their assigned partition.
It’s possible that tens of thousands of map tasks have run on various machines
throughout the cluster, each having output key value pairs for each partition. The
reducer assigned partition 1, for example, would need to fetch each piece of its
partition data from potentially every other worker in the cluster. A logical view of
the intermediate data across all machines in the cluster mightlook like this:
worker 1, partition 2, DEBUG, 1
worker 1, partition 1, INFO, 1
worker 2, partition 1, INFO, 1
worker 2, partition 1, INFO 1
worker 3, partition 2, WARN, 1
With the partition data now combined into a complete sorted list, the user’s
reducer code can now be executed:
# Logical data input to the reducer assigned partition 1:INFO, [ 1, 1, 1 ]
# Logical data input to the reducer assigned partition 2:DEBUG, [ 1 ]
WARN, [ 1 ]
Each reducer produces a separate output file, usually in HDFS (see Figure 3-2).
Separate files are written so that reducers do not have to coordinate access to a
shared file. This greatly reduces complexity and lets each reducer run at whatever
speed it can. The format of the file depends on the output format specified by the
author of the MapReduce job in the job configuration. Unless the job does
something special (and most don’t) each reducer output file is named part-
<XXXXX>, where <XXXXX> is the number of the reduce task within the job, starting
from zero. Sample reducer output for our example job would look as follows:
# Reducer for partition 1:
INFO, 3
# Reducer for partition 2:
DEBUG, 1
WARN, 1
To produce the same output, use the following SQL statement. In the interest of
readability, ignoring the fact that this doesn’t yield identically formatted output;
the data is the same.
SELECT SEVERITY,COUNT(*)
FROM logs GROUP BY SEVERITY WHERE EVENT_DATE = '2012-02-13'GROUP BY SEVERITY
ORDER BY SEVERITY
As exciting as all of this is, MapReduce is not a silver bullet. It is just as important
to know how MapReduce works and what it’s good for, as it is to understand why
Map- Reduce is not going to end world hunger or serve you breakfast in bed.
In the context of MapReduce, "test data" refers to the data that is used for testing
and validating the correctness and efficiency of MapReduce jobs before running
them on a production cluster. Local tests, on the other hand, involve running
MapReduce jobs on a single machine or a small local cluster to check the logic,
functionality, and performance of the job without utilizing the full-scale resources
of a distributed cluster.
Here's how test data and local tests are typically used in the MapReduce
development process:
• The test data should cover different input data patterns, including corner
cases, to ensure the job's robustness and correctness.
Local Testing:
• Developers write unit tests and integration tests for the MapReduce job's
individual components, such as mapper and reducer functions, to validate
their correctness.
Performance Testing:
• During local tests, developers can also assess the performance of their
MapReduce jobs on smaller datasets, identify potential bottlenecks, and
optimize the job accordingly.
• Once the MapReduce job passes local tests and meets the desired
performance requirements, it can be submitted to a production Hadoop
cluster for processing larger datasets. The data and resources available in
the production cluster are typically much more substantial, so testing on
the local environment is essential for detecting and fixing issues early in
the development process and ensuring the job performs efficiently at scale.
The anatomy of a MapReduce job run involves several phases and components
that work together to process and analyze large-scale data in a distributed
computing environment. Here is a step-by-step overview of how a MapReduce job
is executed:
1. Job Submission:
The process begins when a client submits a MapReduce job to the Hadoop cluster.
The client typically uses Hadoop's command-line interface or an API to submit
the job.
2. Job Initialization:
The input data for the job is divided into smaller splits called "input splits." Each
input split is processed independently by a mapper task. The splitting is
performed by the InputFormat associated with the job, which determines how to
read and divide the input data.
After the map tasks complete, the intermediate key-value pairs are transferred to
the reducers. The shuffle and sort phase groups the values with the same key
together and sorts them based on the keys. This ensures that the reducers
process all values for each key in a sorted order.
9. Output Writing:
The output generated by the reduce tasks is written to the specified output
directory in the Hadoop Distributed File System (HDFS) or another storage
system.
Once all map and reduce tasks are successfully executed, the entire MapReduce
job is considered complete, and the final output is available for further analysis or
use.
JobTracker:
The JobTracker was the central coordinator responsible for managing job
submissions, task scheduling, and resource allocation in the Hadoop cluster. It
received job submissions from clients, divided the jobs into smaller tasks (map and
reduce tasks), and scheduled these tasks across available TaskTrackers in the cluster.
TaskTracker:
Each node in the Hadoop cluster had a TaskTracker, responsible for executing tasks
assigned to it by the JobTracker. A TaskTracker managed the execution of map and
reduce tasks, monitored their progress, and reported the status back to the
JobTracker.
• The JobTracker divided the job into map tasks and reduce tasks and scheduled
them across available TaskTrackers.
• TaskTrackers executed the map tasks and reduced tasks in parallel across the
cluster nodes.
• The map tasks processed the input data and generated intermediate key-value
pairs.
• The JobTracker managed the shuffle and sort phase, where intermediate data
was shuffled and sorted based on keys before being passed to the reducers.
• The reduce tasks processed the shuffled data, aggregated values for each key,
and produced the final output.
Before YARN, Hadoop MapReduce was the only processing framework in Hadoop.
MapReduce was tightly integrated with Hadoop's resource management, making
it difficult to run other types of applications. YARN was introduced as a major
architectural change in Hadoop 2.x to address these limitations and enable multi-
tenancy, flexibility, and scalability.
1. ResourceManager:
2. NodeManager:
3. ApplicationMaster:
For each application running on the cluster (e.g., a MapReduce job or another
application using YARN), there is an ApplicationMaster. The
ApplicationMaster is responsible for negotiating resources with the
ResourceManager and coordinating the tasks and containers needed for the
application to run.
4. Container:
With YARN, Hadoop becomes more versatile, allowing various processing engines
and applications, such as Apache Spark, Apache Flink, Apache HBase, and others,
to coexist and run simultaneously on the same Hadoop cluster, each getting its
share of resources. This flexibility and multi-tenancy support have significantly
improved the capabilities and efficiency of the Hadoop ecosystem.
To address some of these issues and improve overall efficiency, newer technologies
like Apache Tez, Apache Spark, and Apache Flink have emerged. These frameworks
aim to overcome the limitations of classic MapReduce and provide better resource
management, fault tolerance, and performance optimizations. Moreover,
advancements in Hadoop's ecosystem and the continuous development of YARN
help in addressing and minimizing these failure points to make large-scale data
processing more reliable and efficient.
In MapReduce, job scheduling refers to the process of determining when and how
MapReduce jobs are executed on a Hadoop cluster. The job scheduler is responsible for
allocating resources to different jobs and ensuring that they run efficiently while
considering various factors like data locality, available cluster resources, job priorities,
and fairness.
The MapReduce job scheduling process typically involves the following steps:
1. Job Submission: Users or applications submit MapReduce jobs to the Hadoop
cluster through the client interface.
2. Job Splitting: The input data is divided into smaller splits, known as input
splits, which are processed independently by different mapper tasks. The
number of input splits determines the number of map tasks in the job.
3. Task Scheduling: The job scheduler is responsible for allocating resources (i.e.,
containers) to run mapper and reducer tasks. The tasks are scheduled on the
nodes in the cluster based on the data locality, available resources, and other
cluster conditions.
4. Data Locality: The job scheduler tries to assign tasks to nodes where the data
is already present (data locality) to minimize data transfer over the network.
This helps reduce network traffic and improves overall job performance.
5. Task Execution: Once the tasks are scheduled and assigned to nodes, the
TaskTrackers (in older Hadoop versions) or NodeManagers (in YARN-based
Hadoop versions) execute the tasks.
6. Task Monitoring and Failure Handling: The TaskTrackers or NodeManagers
monitor the task execution and report the status (success, failure, or in-
progress) back to the JobTracker (in older Hadoop versions) or
ResourceManager (in YARN-based Hadoop versions). In case of task failures,
the job scheduler may reschedule the failed tasks on other nodes to ensure
fault tolerance and job completion.
7. Job Completion: After all the map and reduce tasks are successfully executed,
the MapReduce job is considered complete, and the results are typically
written to the Hadoop Distributed File System (HDFS) or other storage
systems.
8. Job schedulers in Hadoop can be configured to use different scheduling
algorithms, such as First-Come-First-Served (FCFS), Fair Scheduler, Capacity
Scheduler, or other custom schedulers. The choice of scheduling algorithm
depends on factors like the nature of the workload, resource requirements,
and performance objectives of the cluster.
9. In YARN-based Hadoop clusters, the ResourceManager and the
ApplicationMaster play critical roles in job scheduling and resource
management, providing a more flexible and scalable environment for running
various distributed applications beyond just MapReduce.
1. TextInputFormat: This is the default input format for processing text data. It reads
data line-by-line, where each line becomes a separate input record for the
mappers. The key represents the byte offset of the line, and the value represents
the actual text of the line.
2. KeyValueTextInputFormat: This input format is used when data is in a key-value
format, with each line containing a key and its associated value separated by a
delimiter (e.g., tab or comma). It allows mappers to process key-value pairs as
separate input records.
3. SequenceFileInputFormat: This input format is used to read data stored in
Hadoop's SequenceFile format. SequenceFiles are binary files that allow efficient
storage of key-value pairs. SequenceFileInputFormat reads the keys and values
from SequenceFiles and passes them to the mappers.
4. NLineInputFormat: This input format allows mappers to process a fixed number
of lines as a single input record. It is useful when processing data with a fixed
structure, such as log files.
5. CombineTextInputFormat: This input format combines small input files into
larger splits to reduce the number of mappers and improve efficiency for
processing small files.
6. DBInputFormat: This input format allows mappers to read data from relational
databases using JDBC. It enables MapReduce jobs to process data directly from
database tables.
7. AvroKeyInputFormat and AvroKeyValueInputFormat: These input formats are
used for reading Avro data files, which are a compact and efficient binary data
format.
8. Custom Input Formats: Hadoop allows users to define custom input formats to
process data in formats not covered by the built-in input formats. By
implementing the InputFormat interface, users can customize how data is read
and passed to the mappers.
The choice of input format depends on the nature of the input data and the specific
requirements of the MapReduce job. Proper selection of the input format can
significantly impact job performance and data processing efficiency.
The choice of output format depends on the desired format of the output data and the
storage system used for data persistence. Proper selection of the output format ensures
that the final output is organized efficiently and can be easily processed by other
systems or applications.