0% found this document useful (0 votes)
15 views25 pages

Bda U4

The document provides an overview of the MapReduce programming framework, detailing its two main phases: Map and Reduce, which facilitate distributed processing of large data sets. It discusses the advantages of MapReduce, including scalability, flexibility, speed, and fault tolerance, as well as the anatomy of a MapReduce job run, including job submission, initialization, task assignment, execution, and completion. Additionally, it addresses common failures in MapReduce and YARN, outlining types of failures and strategies for overcoming them.

Uploaded by

manzoor22022003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views25 pages

Bda U4

The document provides an overview of the MapReduce programming framework, detailing its two main phases: Map and Reduce, which facilitate distributed processing of large data sets. It discusses the advantages of MapReduce, including scalability, flexibility, speed, and fault tolerance, as well as the anatomy of a MapReduce job run, including job submission, initialization, task assignment, execution, and completion. Additionally, it addresses common failures in MapReduce and YARN, outlining types of failures and strategies for overcoming them.

Uploaded by

manzoor22022003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

1

Big Data Analytics


UNIT IV : Map Reduce- Anatomy of a Map Reduce Job Run, Failures, Job Scheduling, Shuffle and Sort,
Task Execution, Map Reduce Types and Formats.
Map Reduce
MapReduce is a programming framework that allows us to perform distributed and parallel processing on
large data sets in a distributed environment.
 MapReduce consists of two distinct tasks — Map and Reduce.
 As the name MapReduce suggests, reducer phase takes place after the mapper phase has
been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-value
pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into
a smaller set of tuples or key-value pairs which is the final output.

Advantages of MapReduce:

 Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File System
(HDFS).
 Flexibility. Hadoop enables easier access to multiple sources of data and multiple types of data.
 Speed. With parallel processing and minimal data movement, Hadoop offers fast processing of
massive amounts of data.
 Simple. Developers can write code in a choice of languages, including Java, C++ and Python.
 Cost Reduction - Because MapReduce is highly scalable, it reduces storage and processing costs
to meet growing data requirements.
 Fault Tolerance: Due to its distributed nature, MapReduce is highly fail-safe. Typically,
MapReduce- supported distributed file systems, along with the basic process, provide MapReduce
jobs to
overcome hardware problems.
Architecture Of MapReduce

MapReduce phases are mainly two: Map and Reduce phase. Below is detailed description of 5 phases.
2

Developing a MapReduce Application


MapReduce is a software framework for processing large data sets in a distributed fashion over several
machines.
 The Map Reduce Framework works in two main phases to process the data, which are the
"map" phase and the "reduce" phase.
3

 MapReduce application is based on Hadoop Distributed FileSystem, HDFS.


 Shufle and Sort are the heart of MapReduce which make Big Data Analytics powerful.
 Sort phase is guarantees the input to every reduce is sorted by key.
 Shufle phase is transfers the map output to the reducers as input.
The core idea is < key, value > pairs, almost all data can be mapped into key, value pairs, keys and values
may be of any type.
 Developing a MapReduce Application consists of
1. The Configuration API
2. Setting Up the Development Environment
3. Writing a Unit Test with MR Unit
4. Running Locally on Test Data
5. Running on a Cluster
6. Tuning a Job
7. MapReduce Workflows

Figure 1 Developing a MapReduce application

Anatomy of MapReduce Job Run


You can run a MapReduce job with a single method call: submit() on a Job object (note that you can also
call waitForCompletion(), which will submit the job if it hasn’t been submitted already, then wait for it
to finish).
Once we give a MapReduce job the system will enter into a series of life cycle phases:
1. Job Submission Phase
2. Job Initialization Phase
3. Task Assignment Phase
4. Task Execution Phase
5. Progress and Status update Phase
4

The whole process is illustrated in figure. At the highest level, there are five independent entities:
 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of compute resources on the cluster.
 The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.
 The MapReduce application master, which coordinates the tasks running the MapReduce job The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager and managed by the node managers.
 The distributed filesystem, which is used for sharing job files between the other entities.

1. Job Submission:
 The submit() method on Job creates an internal JobSubmitter instance
and calls submitJobInternal() on it.
 Having submitted the job, waitForCompletion polls the job’s progress once per second and
reports the progress to the console if it has changed since the last report.
 When the job completes successfully, the job counters are displayed Otherwise, the error
that caused the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following:
 Asks the resource manager for a new application ID, used for the MapReduce job ID.
5

 Checks the output specification of the job For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to
the MapReduce program.
 Computes the input splits for the job If the splits cannot be computed (because the input
paths don’t exist, for example), the job is not submitted and an error is thrown to the
MapReduce program.
 Copies the resources needed to run the job, including the job JAR file, the configuration file, and
the computed input splits, to the shared filesystem in a directory named after the job ID.
 Submits the job by calling submitApplication() on the resource manager.
2. Job Initialization :
 When the resource manager receives a call to its submitApplication() method, it hands off
the request to the YARN scheduler.
 The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management.
 The application master for MapReduce jobs is a Java application whose main class is MRAppMaster .
 It initializes the job by creating a number of bookkeeping objects to keep track of the job’s
progress, as it will receive progress and completion reports from the tasks.
 It retrieves the input splits computed in the client from the shared filesystem.
 It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property (set by the setNumReduceTasks() method on
Job).
3. Task Assignment:
 If the job does not qualify for running as an uber task, then the application master
requests containers for all the map and reduce tasks in the job from the resource manager
.
 Requests for map tasks are made first and with a higher priority than those for reduce tasks,
since all the map tasks must complete before the sort phase of the reduce can start.
 Requests for reduce tasks are not made until 5% of map tasks have completed.
4. Task Execution:
 Once a task has been assigned resources for a container on a particular node by the resource
manager’s scheduler, the application master starts the container by contacting the node
manager.
 The task is executed by a Java application whose main class is YarnChild. Before it can run the task,
it localizes the resources that the task needs, including the job configuration and JAR file, and
any files from the distributed cache.
 Finally, it runs the map or reduce task.
Streaming:
6

 Streaming runs special map and reduce tasks for the purpose of launching the user
supplied executable and communicating with it.
 The Streaming task communicates with the process (which may be written in any language)
using standard input and output streams.
 During execution of the task, the Java process passes input key value pairs to the external
process, which runs it through the user defined map or reduce function and passes the output
key value pairs back to the Java process.
 From the node manager’s point of view, it is as if the child process ran the map or reduce code itself.
5. Progress and status updates :
 MapReduce jobs are long running batch jobs, taking anything from tens of seconds to hours to run.
 A job and each of its tasks have a status, which includes such things as the state of the job or task
(e g running, successfully completed, failed), the progress of maps and reduces, the values of the
job’s counters, and a status message or description (which may be set by user code).
 When a task is running, it keeps track of its progress (i.e. the proportion of task is completed).
 For map tasks, this is the proportion of the input that has been processed.
 For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of
the reduce input processed.
It does this by dividing the total progress into three parts, corresponding to the three phases of the shufle.
 As the map or reduce task runs, the child process communicates with its parent application
master through the umbilical interface.
 The task reports its progress and status (including counters) back to its application master,
which has an aggregate view of the job, every three seconds over the umbilical interface.
7

How status updates are propagated through the MapReduce System


 The resource manager web UI displays all the running applications with links to the web UIs of
their respective application masters, each of which displays further details on the MapReduce job,
including its progress.
 During the course of the job, the client receives the latest status by polling the application
master every second (the interval is set via mapreduce.client.progressmonitor.pollinterval).
6. Job Completion:
 When the application master receives a notification that the last task for a job is complete,
it changes the status for the job to Successful.
 Then, when the Job polls for status, it learns that the job has completed successfully, so it prints
a message to tell the user and then returns from the waitForCompletion() .
 Finally, on job completion, the application master and the task containers clean up their
working state and the OutputCommitter’s commitJob () method is called.
 Job information is archived by the job history server to enable later interrogation by users if desired.

MapReduce Example
A very famous example for mapreduce is the wordcount example.
8

 First, we divide the input into three splits as shown in the figure. This will distribute the work
among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of
the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word,
in
itself, will occur once.
 Now, a list of key-value pair will be created where the key is nothing but the individual words and
value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1;
River,
1. The mapping process remains the same on all the nodes.
 After the mapper phase, a partition process takes place where sorting and shufling happen so that
all the tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shufling phase, each reducer will have a unique key and a list of values
corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of values. As shown in the
figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of
ones in the very list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.
Failures in MapReduce
***[ A very important note here is ye jo neeche mention kiye hue failures hain a ye classical mapreduce
failures mein aate. But if they ask Yarn mapreduce failure then write the below:
For MapReduce programs running on YARN, we need to consider the failure of any of the following
entities: the task, the application master, the node manager, and the resource manager.]***
In the real world, user code is buggy, processes crash, and machines fail. One of the major benefits of using
Hadoop is its ability to handle such failures and allow your job to complete. There are generally 3 types of
failures in MapReduce.
 Task Failure
 TaskTracker Failure
 JobTracker Failure
9

1. Task Failure
In Hadoop, task failure is similar to an employee making a mistake while doing a task. Consider you are
working on a large project that has been broken down into smaller jobs and assigned to different
employees in your team. If one of the team members fails to do their task correctly, the entire project may be
compromised. Similarly, in Hadoop, if a job fails due to a mistake or issue, it could affect overall data
processing, causing delays or faults in the final result.
Reasons for Task Failure
1. Limited memory
2. Failures of disk
3. Issues with software or hardware
How to Overcome Task Failure
1. Increase memory allocation
2. Implement fault tolerance mechanisms
3. Regularly update software and hardware
2. TaskTracker Failure
A TaskTracker in Hadoop is similar to an employee responsible for executing certain tasks in a large project. If
a TaskTracker fails, it signifies a problem occurred while an employee worked on their assignment. This
can interrupt the entire project, much as when a team member makes a mistake or encounters difficulties
with their task, producing delays or problems with the overall project's completion. To avoid TaskTracker
failures, ensure the TaskTracker's hardware and software are in excellent working order and have the
resources they need to do their jobs successfully.
Reasons for TaskTracker Failure
1. Hardware issues
2. Software problems or errors
3. Overload or resource exhaustion
How to Overcome TaskTracker Failure
1. Update software and hardware on a regular basis
2. Upgrade or replace hardware
3. Restart or reinstall the program
10

3. JobTracker Failure
A JobTracker in Hadoop is similar to a supervisor or manager that oversees the entire project and assigns
tasks to TaskTrackers (employees). If a JobTracker fails, it signifies the supervisor is experiencing a
problem or has stopped working properly. This can interrupt the overall project's coordination and
development, much as when a supervisor is unable to assign assignments or oversee their completion. To
avoid
JobTracker failures, it is critical to maintain the JobTracker's hardware and software, ensure adequate
resources, and fix any issues or malfunctions as soon as possible to keep the project going smoothly.
Reasons for JobTracker Failure
1. Database connectivity
2. Security problems
How to Overcome JobTracker Failure
1. Avoiding Database Connectivity
2. To overcome security-related problems

Failures in YARN
For MapReduce programs running on YARN, we need to consider the failure of any of the
following entities: the task, the application master, the node manager, and the resource
manager.

1. Task Failure
Failure of the running task is similar to the classic case. Runtime exceptions and sudden exits of the
JVM are propagated back to the application master and the task attempt is marked as failed.
Likewise, hanging tasks are noticed by the application master by the absence of a ping over the
umbilical channel (the timeout is set by mapreduce.task.time out), and again the task attempt is
marked as failed.

The configuration properties for determining when a task is considered to be failed are the same as
the classic case: a task is marked as failed after four attempts (set by mapreduce.map.maxattempts
for map tasks and mapreduce.reduce.maxattempts for reducer tasks). A job will be failed if more
than mapreduce.map.failures.maxpercent percent of the map tasks in the job fail, or more than
mapreduce.reduce.failures.maxpercent percent of the reduce tasks fail.

2. Application Master Failure

Just like MapReduce tasks are given several attempts to succeed (in the face of hardware or
network failures) applications in YARN are tried multiple times in the event of failure. By default,
applications are marked as failed if they fail once, but this can be increased by setting the
property yarn.resourcemanager.am.max-retries.

An application master sends periodic heartbeats to the resource manager, and in the event of
application master failure, the resource manager will detect the failure and start a new instance of
the master running in a new container (managed by a node manager). In the case of the
MapReduce application master, it can recover the state of the tasks that had already been run by
the (failed) application so they don't have to be rerun. By default, recovery is not enabled, so failed
application masters will not rerun all their tasks, but you can turn it on by setting
yarn.app.mapreduce.am.job.recovery.enable to true.
10

The client polls the application master for progress reports, so if its application master fails the client
needs to locate the new instance. During job initialization the client asks the resource manager for the
application master's address, and then caches it, so it doesn't overload the the resource manager with
a request every time it needs to poll the application master. If the application master fails, however,
the client will experience a timeout when it issues a status update, at which point the client will go
back to the resource manager to ask for the new application master's address.

3. Node Manager Failure


If a node manager fails, then it will stop sending heartbeats to the resource manager, and the node
manager will be removed from the resource manager's pool of available nodes. The property
yarn.resourcemanager.nm.liveness-monitor.expiry-intervalms,which defaults to 600000 (10
minutes), determines the minimum time the resource manager waits before considering a node
manager that has sent no heartbeat in that time as failed.

Any task or application master running on the failed node manager will be recovered using the
mechanisms described in the previous two sections.
Node managers may be blacklisted if the number of failures for the application is high. Blacklisting is
done by the application master, and for MapReduce the application master will try to reschedule
tasks on different nodes if more than three tasks fail on a node manager. The threshold may be set
with mapreduce.job.maxtaskfai lures.per.tracker.

4. Resource Manager Failure


Failure of the resource manager is serious, since without it neither jobs nor task containers can
be launched. The resource manager was designed from the outset to be able to recover from
crashes, by using a checkpointing mechanism to save its state to persistent storage, although at
the time of writing the latest release did not have a complete implementation.
After a crash, a new resource manager instance is brought up (by an adminstrator) and it recovers
from the saved state. The state consists of the node managers in the system as well as the
running applications. (Note that tasks are not part of the resource manager's state, since they are
managed by the application. Thus the amount of state to be stored is much more managable
than that of the jobtracker.)
The storage used by the reource manager is configurable via the
yarn.resourcemanager.store.class property. The default is
org.apache.hadoop.yarn.server.resource manager.recovery.MemStore, which keeps the store in
memory, and is therefore not highly-available. However, there is a ZooKeeper-based store in the
works that will support reliable recovery from resource manager failures in the future.

Job Scheduling
Job scheduling is an important part of MapReduce, as it determines the order in which jobs are executed and
the resources that are allocated to them. There are a number of different job scheduling algorithms that
can be used in MapReduce, each with its own advantages and disadvantages.
There are mainly 3 types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
1. FIFO Scheduler
10

As the name suggests FIFO i.e. First In First Out, so the tasks or application that comes first will be served first.
This is the default Scheduler we use in Hadoop. The tasks are placed in a queue and the tasks are
performed in their submission order. In this method, once the job is scheduled, no intervention is allowed. So
sometimes the high-priority process has to wait for a long time since the priority of the task does not
matter in this method.
Advantage:
 No need for configuration
 First Come First Serve
 simple to execute
Disadvantage:
 Priority of task doesn’t matter, so high priority jobs need to wait
 Not suitable for shared cluster
11

2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity Scheduler allows
multiple occupants to share a large size Hadoop cluster. In Capacity Scheduler corresponding for each job
queue, we provide some slots or cluster resources for performing job operation. Each job queue has it’s
own slots to perform its task. In case we have tasks to perform in only one queue then the tasks of that
queue can access the slots of other queues also as they are free to use, and when the new task enters to
some other queue then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing the more cluster
resource or slots, so that the single user or application doesn’t take disappropriate or unnecessary slots in
the cluster. The capacity Scheduler mainly contains 3 types of the queue that are root, parent, and leaf
which are used to represent cluster, organization, or any subgroup, application submission respectively.
Advantage:
 Best for working with Multiple clients or priority jobs in a Hadoop cluster
 Maximizes throughput in the Hadoop cluster
Disadvantage:
 More complex
 Not easy to configure for everyone

3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the job is kept in
consideration. With the help of Fair Scheduler, the YARN applications can share the resources in the large
Hadoop Cluster and these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster get an equal amount of
time. Fair Scheduler takes Scheduling decisions on the basis of memory, we can configure it to work with
CPU also.
12

As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair Scheduler
whenever any high priority job arises in the same queue, the task is processed in parallel by replacing some
portion from the already dedicated slots.
Advantages:
 Resources assigned to each application depend upon its priority.
 it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.

Shuffle and Sort


Shufle and sort phase is core and heart of every map reduce job. Every MapReduce job goes through
shufle and sort phase. MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort—and transfers the map outputs to the reducers as
inputs—is known as the shuffe.
13

The Map Side


 When the map function produces output, it is written to a circular memory buffer.
 The buffer has a default size of 100 MB.
 When the buffer reaches a certain threshold (80% by default), a background thread starts spilling
its contents to disk.
 Before writing to disk, the data is divided into partitions corresponding to the reducers they will
be sent to.
 Within each partition, an in-memory sort by key is performed, and if there is a combiner function,
it is applied to reduce the data size.
 Spills are written to job-specific subdirectories in the directories specified by the
mapred.local.dir property.
 After the map task completes, the spill files are merged into a single partitioned and sorted
output file.
The Reduce Side
 The map outputs are needed by the reduce tasks for their particular partitions.
 The reduce task starts copying the map outputs as soon as each map task completes (copy phase).
 The copies are stored in the reduce task's JVM memory if they are small enough; otherwise,
they are copied to disk.
 The copied map outputs are merged and spilled to disk in rounds, maintaining their sort ordering.
 The merge is done by merging a certain number of files into one in each round, reducing
the number of intermediate files.
 The final merge directly feeds the reduce function, eliminating the need for an additional disk write.
 The reduce function is invoked for each key in the sorted output, and the final output is written
to the output filesystem.
Example: Efficiently Merging 40 file segments with a merge factor of 10.
The number of files merged in each round is actually more subtle than this example suggests. The goal is to
merge the minimum number of files to get to the merge factor for the final round. So, if there were 40
files, the merge would not merge 10 files in each of the four rounds to get 4 files. Instead, the first round
would merge only 4 files, and the subsequent three rounds would merge the full 10 files. The 4 merged
files, and the 6 (as yet unmerged) files make a total of 10 files for the final round. The process is illustrated
in Figure
7-5.
Note that this does not change the number of rounds, it’s just an optimization to minimize the amount of
data that is written to disk, since the final round always merges directly into the reduce.
14

Configuration Tuning
 Various configuration settings can be adjusted to optimize the shufle and sort phase for improved
performance.

Map-side Tuning properties Reduce-side Tuning properties


1) io.sort.mb 1) mapred.reduce.parallel.copies
Each Map task has a Memory Buffer for to write the Output.
Reduce task will copy the output from Map as soon as
This Buffer size can be change by io.sort.mb Property. The
default value for this property is 100 MB. Map task will complete. To fetch the Map output in
2) io.sort.spill Parallel reduce task has number of copier Thread
io.sort.spill defines the threshold size for the buffer. When the which you can set in mapred.reduce.parallel.copies
content in buffer reaches a thresold size put in io.sort.spill
properties. the default value for this property is 5.
property, the content will spill down on disk. The default value
for io.sort.spill is 0.80 or 80%.
the output will always write in buffer. when buffer fillup fully, the 2) mapred.job.shuffle.input.buffer.percent
map will stop until spill complete.So in one task there may be The map output copied to reduce task JVM Memory
multiple spill file, which will merged in to single partition and and the buffer size for this will controlled by
sorted in file.
mapred.job.shufle.input.buffer.percent property. If
3) io.sort.factor
io.sort.factor controls the maximum number of stream merged copied size will increase then it will copied to disk.
in one time. the default value for io.sort.factor is 10. The percentage of memory to be allocated from the
4) mapred.local.dir maximum heap size to storing map outputs during the
Spills are written in directory specified by the mapred.local.dir
shufle. The default value is 0.70 or 77%.
property. spill written in round robin method. The output
divides in partition and sort will happen on each partition.
Finally combiner will run on sort output. 3) mapred.job.shuffle.merge.percent
5) min.num.spills.for.combine When in memory reaches threashold size it will copy
minimum spill files for which the combiner will run before the to disk. in simple term The usage threshold at which
output file is written.
an in-memory merge will be initiated. This threshold
6) mapred.compress.map.output
if mapred.compress.map.output is true then it will compress the will set by mapred.job.shufle.merge.percent property.
map output. By default Map output is not compressed. There The default value is 0.66 or 66%.
are different compressed library you can use for compression.
7) tasktracker.http.thread
4) mapred.inmem.merge.threshold
This property defines the maximum number of threads used to
do the file partition. This setting is per task tracker. remember This property will handle the threshold number of
not per map task slot Map outputs, it merged and spill to disk.
15

Task Execution
Task Execution Environment:
 Hadoop provides information to map or reduce tasks about their execution environment.
 Properties like mapred.job.id, mapred.tip.id, mapred.task.id, mapred.task.partition,
and mapred.task.is.map can be accessed from the job's configuration.
 Streaming programs can retrieve these properties as environment variables, where
non- alphanumeric characters are replaced with underscores.
 Environment variables can also be set for Streaming processes using the -cmdenv option.
Speculative Execution:
 Speculative execution is a feature in MapReduce that launches backup tasks for slow-running tasks.
 It aims to reduce job execution time by running redundant tasks in parallel.
 Speculative execution does not launch duplicate tasks to compete against each other; it launches
a speculative task only after all tasks have been launched.
 Speculative tasks are only launched for tasks that have been running for some time and have
made slower progress compared to other tasks.
 When a task completes successfully, any duplicate tasks running are killed.
 Speculative execution can be enabled or disabled independently for map tasks and reduce
tasks, either on a cluster-wide or per-job basis.
 It is turned on by default, but it can be disabled to improve cluster efficiency or for non-
idempotent tasks.
Output Committers:
 Hadoop MapReduce uses OutputCommitters to ensure jobs and tasks either succeed or fail cleanly.
 The OutputCommitter API provides methods for setup, commit, and abort operations.
 The setupJob() method is called before the job runs and is typically used for initialization.
 The commitJob() method is called if the job succeeds, and it performs cleanup operations
like deleting temporary working spaces and creating a _SUCCESS marker file.
 The abortJob() method is called if the job fails or is killed, and it performs cleanup operations
as well.
 The setupTask() method is called before each task runs and is used for any necessary setup.
 The commitTask() and abortTask() methods are called for each task to commit or abort the
task's outputs.
 The needsTaskCommit() method determines if the commit phase for tasks is necessary, and it
can be disabled to save resources.
OutputCommitters allow customization and can be overridden or implemented differently for specific
requirements, such as special setup or cleanup operations.
16

Map Reduce Types


The map and reduce functions in Hadoop MapReduce have the following general form:
Map: (K1, V1)  List(K2, V2)
Reduce: (K2, list(V2))  List(K3, V3)
Here map input key K1 and value V1 are different from map output type K2 and V2. Reduce input is same
as map output but output of reduce is different as K3 and V3. The java API mirrors this general form:

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {


public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
}
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
// ...
}
}
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
}
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context
context Context context) throws IOException, InterruptedException {
// ...
}
}

The context objects are used for emitting key-value pairs, so they are parameterized by the output types,
so that the signature of the write() method is:
public void write(KEYOUT key, VALUEOUT
value) throws IOException,
InterruptedException
If a combine function is used, then it is the same form as the reduce function (and is an implementation of
Reducer), except its output types are the intermediate key and value types (K2 and V2), so they can feed
the reduce function:
17

map: (K1, V1) → list(K2, V2)


combine: (K2, list(V2)) → list(K2,
V2) reduce: (K2, list(V2)) → list(K3,
V3)
Often the combine and reduce functions are the same, in which case, K3 is the same as K2, and V3 is the
same as V2. The partition function operates on the intermediate key and value types (K2 and V2), and
returns the partition index. In practice, the partition is determined solely by the key (the value is ignored):
partition: (K2, V2) → integer
Or in Java:
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE, int numPartitions);
}

MapReduce formats
Hadoop can process many different types of data formats, from flat text files to databases. In this section,
we explore the different formats available.
18
19
20
21
22
23

You might also like