0% found this document useful (0 votes)
35 views44 pages

Unit 3

Uploaded by

Vaddi Kasulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views44 pages

Unit 3

Uploaded by

Vaddi Kasulu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

SYLLABUS

History of Hadoop - The Hadoop Distributed File System Components of Hadoop-


Analyzing the Data with Hadoop- Scaling Out- Hadoop Streaming- Design of HDFS-Java
interfaces to HDFS- How Map Reduce Works-Anatomy of a Map Reduce Job run-
Failures-Job Scheduling-Shuffle and Sort Task execution Map Reduce Features

1. History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-
publicized feat, the New York T
terabytes of scanned archives from the paper, converting them to PDFs for the Web. The
processing took less than 24 hours to run using 100 machines, and the project probably
-by-the-hour
model (which allowed the NYT to access a large number of machines for a short period) and
-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort an
entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds

the same year, Google reported that its MapReduce implementation sorted 1terabyte in 68
seconds. Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to sort
1 terabyte in 62 seconds. The trend since then has been to sort even larger volumes of data at
ever faster rates. In the 2014 competition, a team from Databricks were joint winners of the
Gray Sort benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406
seconds, a rate of 4.27 terabytes per minute.
Today, Hadoop is widely used in mainstream
purpose storage and analysis platform for big data has been recognized by the industry, and
this fact is reflected in the number of products that use or incorporate Hadoop in some way.
Commercial Hadoop support is available from large, established enterprise vendors, including
EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as
Cloudera, Hortonworks, and MapR.
HADOOP BASICS:
Need to process huge datasets on large clusters of computers
Very expensive to build reliability into each application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence
Key Benefit & Flexibility

1
Client-server Concept

the requested information to the client.


port number for requests
Examples:
Server -web server
Client web browser
Very Large Distributed File System
10K nodes, 100 million files, 10PB
Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth

2. The Hadoop Distributed File System (HDFS)


HDFS cluster has two types of nodes operating in a master-worker pattern:
i. A namenode the master
ii. A number of datanodes workers

Fig. 3.1 HDFS Architecture


Namenode manages the filesystem namespace
It maintains the file system tree and the metadata for all the files and directories in the
tree
The information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log
The namenode also knows the datanodes on which all the blocks for a given file are
located
However, it does not store block locations persistently, because this information is
reconstructed from datanodes when the system starts

2
Datanodes workhorses of the filesystem
They store and retrieve blocks when they are told to (by clients or namenode)
And they report back to the namenode periodically with lists of blocks that they are
storing
Without the namenode, the filesystem cannot be used
If the machine running the namenode were obliterated, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from
the blocks on the datanodes.
For this reason, it is important to make the namenode resilient to failure

Fig. 3.2 HDFS Data Blocks

Fig. 3.3 DataNode Failure

3
It is important to make the namenode resilient to failure;
Hadoop provides two mechanisms for this
1. The first way is to back up the files Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems
2. Secondary namenode does not act as a namenode. Its main role is to periodically
merge the namespace
Block Caching
Normally, a datanode reads blocks from disk, but for frequently accessed files the
block cache

Job schedulers (for MapReduce, Spark and other frameworks) can take advantage of
cached blocks by running tasks on the datanode where a block is cached, for increased
read performance
Users or applications instruct the namenode - which files to cache and for how long
by adding a cache directive to a cache pool
Cache pools are an administrative grouping for managing cache permission and
resource usage
HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in memory
which means that on very large clusters with many files, memory becomes the limiting
factor for scaling
HDFS Federation introduced in the 2.x release series; allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem namespace
For e.g., one namenode might manage all the files rooted under /user and a second
namenode might handle files under /share
Under federation, each namenode manages a namespace volume,
which is made up of the metadata for the namespace and
a block pool containing all the blocks for the files in the namespace
Namespace volumes are independent of each other, which means namenodes do not
communicate with one another
Failure of one namenode does not affect the availability of the namespace managed by
other namenodes
Block pool storage is not partitioned, however, so datanodes register with each
namenode in the cluster and store blocks from multiple block pools
HDFS High Availability
The combination of replicating namenode metadata on multiple filesystems and using
the secondary namenode to create checkpoints protects against data loss, but it does not
provide high availability of the filesystem
The namenode is still a single point of failure (SPOF)
If it fails, all clients including MapReduce jobs would be unable to read, write or list
files,
because the namenode is the sole repository of the metadata and the file-to-
block mapping
In such an event, the whole Hadoop system would effectively be out of service until a
new namenode could be brought online

4
To recover from a failed namenode in this situation, an administrator starts a new
primary namenode with one of the filesystem metadata replicas and configures
datanodes and clients to use this new namenode
The new namenode is not able to serve requests until it has
Loaded its namespace image into memory
Replayed its edit log
Received enough block reports from the datanodes to leave safe mode
On large clusters with many files and blocks , the time it takes for a namenode to start
can be 30 minutes or more
The long recovery time is a problem for routine maintenance
Hadoop remedied this situation by adding support for HDFS high availability
In this implementation, there are a pair of namenodes in an active-standby configuration
In the event of the failure of the active namenode, the standby takes over its duties to
continue servicing client requests without a significant interruption
Few architectural changes are needed
The namenodes must use highly available shared storage to share the edit log
Datanodes must send block reports to both namenodes because the block

Clients must be configured to handle namenode failover, using a mechanism


that is transparent to users
, which takes

3. Components of Hadoop

Fig. 3.4 Components of Hadoop

Fig. 3.5 HDFS Architecture

5
Fig. 3.6 MapReduce

Fig. 3.7 MapReduce

Fig. 3.8 Components of Hadoop

6
Fig. 3.9 YARN

Fig. 3.10 YARN

Yarn which is short for Yet Another Resource Negotiator.


It is like the operating system of Hadoop as it monitors and manages the resources.
Yarn came into the picture with the launch of Hadoop 2.x in order to allow different
workloads.
It handles the workloads like stream processing, interactive processing, and batch
processing over a single platform.
Yarn has two main components Node Manager and Resource Manager.
Node Manager
-node agent and takes care of the individual compute nodes in a Hadoop
cluster.
It monitors the resource usage like CPU, memory etc. of the local node and intimates
the same to Resource Manager.
Resource Manager
It is responsible for tracking the resources in the cluster and scheduling tasks like map-
reduce jobs.
Also, there is
Application Master and

7
Scheduler
in Yarn.
Also, there is
Application Master and
Scheduler
in Yarn.

Fig. 3.11 YARN


Application Master has two functions and they are:-
Negotiating resources from Resource Manager
Working with NodeManager to monitor and execute the sub-task.
Following are the functions of Resource Scheduler:-
It allocates resources to various running applications
But it does not monitor the status of the application.
So in the event of failure of the task, it does not restart the same.
Another concept Container.
It is nothing but a fraction of NodeManager capacity i.e. CPU, memory, disk, network
etc.

Fig. 3.12 Need for YARN

8
Fig. 3.13 YARN Advantages

Fig. 3.14 YARN Infrastructure

Fig. 3.15 YARN Resource Manager

9
Fig. 3.16 Application workflow in Hadoop YARN

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
8. Once the processing is complete, the Application Manager un-registers with the
Resource Manager
The objective of Apache Hadoop ecosystem components is to have an overview of what
are the different components of Hadoop ecosystem that make Hadoop so powerful and due to
which several Hadoop job roles are available now. We will also learn about Hadoop ecosystem
components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache
Pig, Apache HBase and HBase
components, HCatalog, Avro, Thrift, Drill, Apachemahout, Sqoop, ApacheFlume, Ambari, Z
ookeeper and Apache Oozie to deep dive into Big Data Hadoop and to acquire master level
knowledge of the Hadoop Ecosystem.

10
Fig. 3.17 Hadoop Ecosystem and Their Components

The list of Hadoop Components is discussed in this section one by one in detail.
Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is
a distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration
is needed. Hadoop interact directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS-
now discuss these Hadoop HDFS Components-
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Tasks of HDFS NameNode
Manage file system namespace.

Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The first file is for data and second file

each Datanode connects to its corresponding Namenode and does handshaking. Verification of

11
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
DataNode manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that process
the vast amount of structured and unstructured data stored in the Hadoop Distributed File
system.
MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster. Thus, it improves the speed and
reliability of cluster this parallel processing.

Fig. 3.18 Hadoop MapReduce


Working of MapReduce

phases:
Map phase
Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies
two functions: map function and reduce function. Map function takes a set of data and converts
it into another set of data, where individual elements are broken down into tuples (key/value
pairs). Reduce function takes the output from the Map as an input and combines those data
tuples based on the key and accordingly modifies the value of the key.

12
Features of MapReduce
Simplicity MapReduce jobs are easy to run. Applications can be written in any language
such as java, C++, and python.
Scalability MapReduce can process petabytes of data.
Speed By means of parallel processing problems that take days to solve, it is solved in
hours and minutes by MapReduce.
Fault Tolerance MapReduce takes care of failures. If one copy of data is unavailable,
another machine has a copy of the same key pair which can be used for solving the same
subtask.
YARN
` Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component
that provides the resource management. Yarn is also one the most important component of
Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for
managing and monitoring workloads. It allows multiple data processing engines such as real-
time streaming and batch processing to handle data stored on a single platform.

Fig. 3.19 Hadoop Yarn Diagram


YARN has been projected as a data operating system for Hadoop2. Main features of YARN
are:
Flexibility Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other applications
can also be run along with Map Reduce programs in Hadoop2.
Efficiency As many applications run on the same cluster, hence, efficiency of Hadoop
increases without much effect on quality of service.
Shared Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing and
iterative modelling are now possible for data processing.

13
Hive
The Hadoop ecosystem component, Apache Hive, is an open source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive do three main
functions: data summarization, query, and analysis.Hive use language called HiveQL (HQL),
which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce
jobs which will execute on Hadoop.

Fig. 3.20 Hive Diagram


Main parts of Hive are:
Metastore It stores the metadata.
Driver Manage the lifecycle of a HiveQL statement.
Query compiler Compiles HiveQL into Directed Acyclic Graph (DAG).
Hive server Provide a thrift interface and JDBC/ODBC server.
Pig
Apache Pig is a high-level language platform for analyzing and querying huge dataset
that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses Pig Latin language. It
is very similar to SQL. It loads the data, applies the required filters and dumps the data in the
required format. For Programs execution, pig requires Java runtime environment.

14
Fig. 3.21 Pig Diagram

Features of Apache Pig:


Extensibility For carrying out special purpose processing, users can create their own
function.
Optimization opportunities Pig allows the system to optimize automatic execution. This
allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data Pig analyzes both structured as well as unstructured.
HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that
was designed to store structured data in tables that could have billions of row and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
Maintain and monitor the Hadoop cluster.
Performs administration (interface for creating, updating and deleting tables.)
Controls the failover.
HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from
clients. Region server process runs on every node in Hadoop cluster. Region server runs on
HDFS DateNode.

15
Fig. 3.22 HBase Diagram
HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different
components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and
write data from the cluster. HCatalog is a key component of Hive that enables the user to store
their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
Benefits of HCatalog:
Enables notifications of data availability.
With the table abstraction, HCatalog frees the user from overhead of data storage.
Provide visibility for data cleaning and archiving tools.
Avro
Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system. Avro is an open source project that provides data serialization and data exchange
services for Hadoop. These services can be used together or independently. Big data can
exchange programs written in different languages using Avro.Using serialization service
programs can serialize data into files or messages. It stores data definition and data together in
one message or file making it easy for programs to dynamically understand information stored
in Avro file or message.

16
Avro schema It relies on schemas for serialization/deserialization. Avro requires the schema
for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.
Dynamic typing It refers to serialization and deserialization without code generation. It
complements the code generation which is available in Avro for statically typed language as
an optional optimization.
Features provided by Avro:
Rich data structures.
Remote procedure call.
Compact, fast, binary data format.
Container file, to store persistent data.
Thrift
It is a software framework for scalable cross-language services development. Thrift is
an interface definition language for RPC(Remote procedure call) communication. Hadoop does
a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift
for performance or other reasons.

Fig. 3.23 Thrift Diagram


Apache Drill
The main purpose of the Hadoop Ecosystem Component is large-scale data processing
including structured and semi-structured data. It is a low latency distributed query engine that

17
is designed to scale to several thousands of nodes and query petabytes of data. The drill is the
first distributed SQL query engine that has a schema-free model.
Application of Apache drill
The drill has become an invaluable tool at cardlytics, a company that provides consumer
purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process
trillions of record and execute queries.
Features of Apache Drill:
The drill has specialized memory management system to eliminates garbage collection
and optimize memory allocation and usage. Drill plays well with Hive by allowing developers
to reuse their existing Hive deployment.
Extensibility Drill provides an extensible architecture at all layers, including query layer,
query optimization, and client API. We can extend any layer for the specific need of an
organization.
Flexibility Drill provides a hierarchical columnar data model that can represent complex,
highly dynamic data and allow efficient processing.
Dynamic schema discovery Apache drill does not require schema or type specification
for data in order to start the query execution process. Instead, drill starts processing the
data in units called record batches and discover schema on the fly during processing.
Drill decentralized metadata Unlike other SQL Hadoop technologies, the drill does not
have centralized metadata requirement. Drill users do not need to create and manage tables
in metadata in order to query data.
Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm and
data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science
tools to automatically find meaningful patterns in those big data sets.
Algorithms of Mahout are:
Clustering Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.
Collaborative filtering It mines user behaviour and makes product recommendations
(e.g. Amazon recommendations)
Classifications It learns from existing categorization and then assigns unclassified items
to the best category.
Frequent pattern mining It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components
like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop
works with relational databases such as teradata, Netezza, oracle, MySQL.
Features of Apache Sqoop:
Import sequential datasets from mainframe Sqoop satisfies the growing need to move
data from the mainframe to HDFS.
Import direct to ORC files Improves compression and light weight indexing and improve
query performance.

18
Parallel data transfer For faster performance and optimal system utilization.
Efficient data analysis Improve efficiency of data analysis by combining structured data
and unstructured data on a schema on reading data lake.
Fast data copies from an external system into Hadoop.

Fig. 3.24 Apache Sqoop Diagram


Apache Flume
Flume efficiently collects, aggregate and moves a large amount of data from its origin
and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop
Ecosystem component allows the data flow from the source into Hadoop environment. It uses
a simple extensible data model that allows for the online analytic application. Using Flume, we
can get the data from multiple servers immediately into hadoop.

Fig. 3.25 Apache Flume

19
Ambari
Ambari, another Hadoop ecosystem component, is a management platform for
provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management
gets simpler as Ambari provide consistent, secure platform for operational control.

Fig. 3.26 Ambari Diagram


Features of Ambari:
Simplified installation, configuration, and management Ambari easily and efficiently
create and manage clusters at scale.
Centralized security setup Ambari reduce the complexity to administer and configure
cluster security across the entire platform.
Highly extensible and customizable Ambari is highly extensible for bringing custom
services under management.
Full visibility into cluster health Ambari ensures that the cluster is healthy and available
with a holistic approach to monitoring.
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for
maintaining configuration information, naming, providing distributed synchronization and
providing group services. Zookeeper manages and coordinates a large cluster of machines.

20
Fig. 3.27 ZooKeeper Diagram
Features of Zookeeper:
Fast Zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
Ordered Zookeeper maintains a record of all transactions.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines
multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated
with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for
apache MapReduce, Pig, Hive, and Sqoop.

Fig. 3.28 Oozie Diagram


In Oozie, users can create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in Hadoop. Oozie is scalable and can manage timely execution of
thousands of workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easily
start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it
in Oozie.
There are two basic types of Oozie jobs:

21
Oozie workflow It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
Oozie Coordinator It runs workflow jobs based on predefined schedules and availability
of data.
This was all about Components of Hadoop Ecosystem

4. Analyzing the Data with Hadoop


Map and Reduce
MapReduce works by breaking the processing into two phases:
the map phase and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen
by the programmer.
The programmer also specifies two functions: the map function and the reduce function.
The input to our map phase is the raw NCDC data.
A text input format is chosen that gives each line in the dataset as a text value.
The key is the offset of the beginning of the line from the beginning of the file
Map function is simple.
year and the air temperature is pulled out, since these are the only fields - interested in.
In this case, the map function is just a data preparation phase, setting up the data in such
a way that the reducer function can do its work on it: finding the maximum temperature
for each year.
The map function is also a good place to drop bad records: here temperatures are filtered
out - that are missing, suspect, or erroneous.

Fig. 3.29 MapReduce Logical Dataflow


Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code.
Need three things: a map function, a reduce function, and some code to run the job.
Example Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

22
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92)); }
String quality = line.substring(92, 93);
if (airTemperature != MISSING &&quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature)); } } }
Example Reducer for maximum temperature example
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}
Example Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

23
public class MaxTemperature {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature<input path><output path>");
System.exit(-1); }
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf); } }
5. Scaling Out
data flow for large inputs.
For simplicity, the examples so far have used files on the local filesystem.
However, to scale out,
data should be stored in a distributed filesystem, typically HDFS,
to allow Hadoop to move the MapReduce computation
to each machine hosting a part of the data.
Having many splits means the time taken to process each split is small compared to the
time to process the whole input.
So if we are processing the splits in parallel, the processing is better load-balanced if
the splits are small, since a faster machine will be able to process proportionally more
splits over the course of the job than a slower machine.
On the other hand,
if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time.
For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default,
although this can be changed for the cluster
Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization.
It should now be clear why the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node.
Map tasks write their output to the local disk, not to HDFS.
Why is this?

output, and once the job is complete the map output can be thrown away.
So storing it in HDFS, with replication, would be overkill.
If the node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node to re-
create the map output.

24
Fig. 3.30 MapReduce Data flow with the single reduce task

Fig. 3.31 MapReduce Data flow with the multiple reduce tasks

25
Fig. 3.32 MapReduce Data flow with the no reduce task
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks.
Hadoop allows the user to specify a combiner function to be run on the map output
output forms the input to the reduce function.
Since the combiner function is an optimization, Hadoop does not provide a guarantee
of how many times it will call it for a particular map output record, if at all.
In other words, calling the combiner function zero, one, or many times should produce
the same output from the reducer
The contract for the combiner function constrains the type of function that may be used.
This is best illustrated with an example.
Suppose that for the maximum temperature example, readings for the year 1950 were
processed by two maps (because they were in different splits).
Imagine the first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
since 25 is the maximum value in the list.
A combiner function can be used - just like the reduce function, finds the maximum temperature
for each map output.
The reduce would then be called with:
(1950, [20, 25])

26
and the reduce would produce the same output as before.
More succinctly, we may express the function calls on the temperature values in this
case as follows:
max (0, 20, 10, 25, 15) = max (max (0, 20, 10), max (25, 15)) = max (20, 25) = 25
If calculating
function, since:
mean (0, 20, 10, 25, 15) = 14
but:
mean (mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

But it can help cut down the amount of data shuffled between the maps and the reduces,
and for this reason alone it is always worth considering whether a combiner function
can be used in the MapReduce job.

6. Hadoop Streaming
Hadoop provides an API to MapReduce that allows writing map and reducing functions
in languages other than Java.
Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
the program, so any language can be used that can read standard input and write to
standard output to write the MapReduce program.
Streaming is naturally suited for text processing and when used in text mode, it has a
line-oriented view of data.
Map input data is passed over standard input to the map function, which processes it
line by line and writes lines to standard output.
A map output key-value pair is written as a single tab-delimited line.
Input to the reduce function is in the same format a tab-separated key-value pair
passed over standard input.
The reduce function reads lines from standard input, which the framework guarantees
are sorted by key, and writes its results to standard output.
------Ruby, Python

7. Design of HDFS
Distributed File System
File Systems that manage the storage across a network of machines
Since they are network based, all the complications of network programming occur
This makes DFS more complex than regular disk file systems
For example, one of the biggest challenges is making the filesystem tolerate
node failure without suffering data loss
HDFS a file system designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
Very Large files
Hundreds of megabytes, gigabytes or terabytes in size
There are Hadoop clusters running today store petabytes of data
Streaming Data Access
HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern

27
A dataset is typically generated or copied from source, and then various analyses
are performed on that dataset over time
Commodity Hardware

It is designed to run on clusters of commodity hardware(commonly available


hardware that can be obtained from multiple vendors)
Chance of node failure is high, atleast for large clusters
HDFS is designed to carry on working without interruption to the user in the
face of such failure
Areas where HDFS is not good fit today
Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS
HDFS is designed for delivering a high throughput of data and this may be at
the expense of latency
Hbase better choice of low-latency access
Lots of small files
Because the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode
As a rule of thumb, each file, directory and block takes about 150 bytes.
So, e.g.,if one million files, each taking one block, would need at least 300 MB
of memory
Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer
Writers are always made at the end of the file, in append-only fashion
There is no support for multiple writers or for modifications at arbitrary offsets
in the file
8. Java interfaces to HDFS
Reading Data from a Hadoop URL
One of the simplest ways to read a file from Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from
General idiom is:
InputStream in = null;
try {

// process in
} finally {
IOUtils.closeStream(in);
}
IOUtils:org.apache.hadoop.io Generic i/o code for reading and writing data to HDFS.
It is a utility class (handy tool) for I/O related functionality on HDFS.

scheme
This is achieved by calling the

28
setURLStreamHandlerFactory() method on URL with a
instance of FsUrlStreamHandlerFactory
This method can be called only once per JVM, so it is typically executed in a static
block
This limitation means that if some other part of your program perhaps a third-party
component outside our control sets a URLStreamHandlerFactory, -
use this approach for reading data from Hadoop

Reading Data Using the FileSystem API


Sometimes, it is impossible to set a URLStreamHandlerFactory for the application
In this case, use the FileSystem API to open an input stream for a file.
A file in a Hadoop filesystem is represented by a Hadoop Path object (and not a
java.io.File object, since its semantics are too closely tied to the local filesystem).
Path - a Hadoop filesystem URI, such as hdfs://localhost/user/tom/quangle.txt.
FileSystem is a general filesystem API, so the first step is to retrieve an instance for the
filesystem we want to use HDFS in this case.
There are three static factory methods for getting a FileSystem instance:
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws
IOException

using configuration files read from the classpath, such as conf/core-site.xml.


The first method returns the default filesystem (as specified in the file conf/core-
site.xml, or the default local filesystem if not specified there).

use, falling back to the default filesystem if no scheme is specified in the given URI.
The third retrieves the filesystem as the given user, which is important in the context of
security
Writing Data
The FileSystem class has a number of methods for creating a file.
The simplest is the method that takes a Path object for the file to be created and returns
an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
There are overloaded versions of this method that allow you to specify whether to
forcibly overwrite existing files, the replication factor of the file, the buffer size to use
when writing the file, the block size for the file, and file permissions.
, so
the application can be notified of the progress of the data being written to the datanodes:
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, you can append to an existing file using the
append() method (there are also some other overloaded versions):
public FSDataOutputStream append(Path f) throws IOException

29
Directories
FileSystem provides a method to create a directory:
public booleanmkdirs(Path f) throws IOException

It returns true if the directory (and all parent directories) was (were) successfully
created.
t need to explicitly create a directory, since writing a file, by calling
create(), will automatically create any parent directories.
Querying the Filesystem
File metadata: FileStatus
An important feature of any filesystem is the ability to navigate its directory structure
and retrieve information about the files and directories that it stores.
The FileStatus class encapsulates filesystem metadata for files and directories,
including file length, block size, replication, modification time, ownership, and
permission information.
The method getFileStatus() on FileSystem provides a way of getting a FileStatus object
for a single file or directory.
Deleting Data
Use the delete() method on FileSystem to permanently remove files or directories:
public boolean delete(Path f, boolean recursive) throws IOException
If f is a file or an empty directory, then the value of recursive is ignored.
A nonempty directory is only deleted, along with its contents, if recursive is true
(otherwise an IOException is thrown).

9. How MapReduce Works


10. Anatomy of a MapReduce Job run
Anatomy of a MapReduce Job Run
Can run a MapReduce job with a single method call:
submit () on a Job object
Can also call
waitForCompletion()
waits for it to finish
At the highest level, there are five independent entries:
The client, which submits the MapReduce job
The YARN resource manager, which coordinates the allocation of computer resources
on the cluster
The YARN node managers, which launch and monitor the compute containers on
machines in the cluster
The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
that are scheduled by the resource manager and managed by the node managers.
The DFS (normally HDFS), which is used for sharing job files between the other
entities.

30
Fig. 3.33 Anatomy of MapReduce Job Run
Job Submission
The submit() method on job creates an internal JobSubmitter instance and calls
submitJobInternal() on it (step 1 in figure)
Having submitted the job, waitForCompletion()
second and reports the progress to the console if it has changed since last report
When the job completes successfully, the job counters are displayed.
Otherwise, the error that caused the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following
Asks the resource manager for a new application ID, used for the MapReduce job
ID(step 2)
Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program
Computes the input splits for the job. If the splits cannot be computed (because of the

the MapReduce program


Copies the resources needed to run the job, including the job JAR file, the configuration
file and the computed input splits, to the shared filesystem in a directory named after
the job ID (step 3). The job JAR is copied with a high replication factor so that there
are lots of copies across the cluster for the node managers to access when they run tasks
for the job.
Submits the job by calling submitApplication() on the resource manager (step 4)

31
Job Initialization
When the resource manager receives a call to its submitApplication() method, it
handsoff the request to the YARN scheduler.
The scheduler allocates a container, and the resource manager then launches the

and 5b)
The application master for MapReduce jobs is a Java application whose main class
isMRAppMaster.
It initializes the job by creating a number of bookkeeping objects to keep track of the

Next, it retrieves the input splits computed in the client from the shared filesystem (step
7)
It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property (set by the setNumReduceTasks()
method on Job). Tasks are given IDs at this point
The application master must decide how to run the tasks that make up the MapReduce
job
If the job is small, the application master may choose to run the tasks in the same JVM
as itself.
This happens when it judges that the overhead of allocating and running tasks in new
containers outweighs the gain to be had in running them in parallel, compared to
running them sequentially on one node. Such a job is said to be uberized or run as an
uber task.
What qualifies as a small job?
By default,
a small job is one that has
less than 10 mappers,
only one reducer and
an input size that is less than the size of one HDFS block
Before it can run the task, it localizes the resources that the task needs, including the
job configuration and JAR file and any files from the distributed cache (step 10).
Finally it runs the map or reduce task (step 11)
The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and
by causing it to crash or hang.
Streaming
Streaming runs special map and reduce tasks for the purpose of launching the user-
supplied executable and communicating with it
Job Completion
When the application master receives a notification that the last task for a job is

Finally, on job completion, the application master and the task containers clean up their
working state
11.Failures
In the real world, user code is buggy, processes crash and machines fail
One of the major benefits of using Hadoop is
its ability to handle such failures and allow the job to complete successfully

32
Failure of any of the following entities considered
The task
The application master
The node manager
The resource manager
Task Failure
The most common occurrence of this failure is when user code in the map or reduce
task throws a runtime exception
If this happens, the task JVM reports the error back to its parent application master
before it exits
The error ultimately makes it into the user logs
The application master marks the task attempt as failed, and frees up the container so
its resources are available for another task
Another failure sudden exit of the task JVM
In this case, the node manager notices that the process has exited and informs the
application master so it can mark the attempt as failed
When the application master is notified of a task attempt that has failed, it will
reschedule the execution of the task
Application Master Failure
Just like MapReduce tasks are given several attempts to succeed, applications in YARN
are retired in the event of failure
The maximum number of attempts to run a MapReduce application master is controlled
by the mapreduce.am.max.attemptsproperty
The default value is 2, so if a MapReduce application master fails twice it will not be
tried again and the job will fail
YARN imposes a limit for the maximum number of attempts for any YARN application
The limit is set by yarn.resourcemanager.am.max.attemptsand defaults to 2
Node Manager Failure
If a node manager fails by crashing or running very slowly, it will stop sending
heartbeats to the resource manager
The resource manager will notice a node manager that has stopped sending heartbeats
- remove it from its pool of nodes to schedule
containers on
Any task or application master running on the failed node manager will be recovered
Node managers may be blacklisted if the number of failures for the application is high
Resource Manager Failure
Failure of the resource manager is serious because without it, neither jobs nor task
containers can be launched
In the default configuration, the resource manager is a single point of failure, since in
the event of machine failure, all running jobs fail
To achieve High Availability (HA), it is necessary to run a pair of resource managers
in an active-standby configuration
If the active resource manager fails, then the standby can take over without a significant
interruption to the client
Information about all the running applications is stored in a highly available state store,
so that the standby can recover the core state of the failed active resource manager.

33
Node manager information is not stored in the state store since it can be reconstructed
relatively quickly by the new resource manager as the node managers send their first
heartbeats
When the new resource manager starts, it reads the application information from the
state store, then restarts the application masters for all applications running on the
cluster
The transition of a resource manager from standby to active is handled by a failover
controller
Clients and node managers must be configured to handle resource manager failover,
since there are now two possible resource managers to communicate with
They try connecting to each resource manager in a round-robin fashion until they find
the active one
If the active fails, then they will retry until the standby becomes active

12. Job Scheduling

ran in order of submission, using a FIFO scheduler.


Each job would use the whole cluster - so jobs had to wait for their turn.
The problem of sharing resources fairly between users requires a better scheduler.
Production jobs need to complete in a timely manner, while allowing users who are
making smaller ad hoc queries to get results back in a reasonable time.

mapred.job.priorityproperty or the setJobPriority() method on JobClient (both of


which take one of the values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW).
When the job scheduler is choosing the next job to run, it selects one with the highest
priority.
However, with the FIFO scheduler, priorities do not support preemption
so a high-priority job can still be blocked by a long-running low priority job that started
before the high-priority job was scheduled.
MapReduce in Hadoop comes with a choice of schedulers.
The default is the original FIFO queue-based scheduler, and
there are also multiuser schedulers called
the Fair Scheduler and
the Capacity Scheduler.

The Fair Scheduler


aims to give every user a fair share of the cluster capacity over time.
If a single job is running, it gets all of the cluster.
As more jobs are submitted, free task slots are given to the jobs in such a way as to give
each user a fair share of the cluster.
A short job belonging to one user will complete in a reasonable time even while another
make progress.
Jobs are placed in pools, and by default, each user gets their own pool.
It is also possible to define custom pools with guaranteed minimum capacities defined
in terms of the number of map and reduce slots, and to set weightings for each pool.

34
The Fair Scheduler supports preemption, so if a pool has not received its fair share for
a certain period of time, then the scheduler will kill tasks in pools running over capacity
in order to give the slots to the pool running under capacity.
The Fair Scheduler is a module.

contrib/fairschedulerdirectory to the lib directory.


Then set the mapred.jobtracker.taskScheduler property to:
org.apache.hadoop.mapred.FairScheduler

The Capacity Scheduler


The Capacity Scheduler takes a slightly different approach to multiuser scheduling.

may be hierarchical (so a queue may be the child of another queue), and each queue has
an allocated capacity.
This is like the Fair Scheduler, except that within each queue, jobs are scheduled using
FIFO scheduling (with priorities).
In effect, the Capacity Scheduler allows users or organizations (defined using queues)
to simulate a separate MapReduce cluster with FIFO scheduling for each user or
organization.
The Fair Scheduler enforces fair sharing within each pool, so running jobs share the

13. Shuffle & Sort


Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is sorted by key.
The process by which the system performs the sort and transfers the map outputs to
the reducers as inputs is known as the shuffle.
The shuffle is an area of the codebase where refinements and improvements are
continually being made

Fig. 3.34 MapReduce

35
14. Task Execution
Speculative Execution
The MapReduce model is to break jobs into tasks and run the tasks in parallel to make
the overall job execution time smaller than it would otherwise be if the tasks ran
sequentially.
This makes job execution time sensitive to slow-running tasks, as it takes only one slow
task to make the whole job take significantly longer than it would have done otherwise.
When a job consists of hundreds or thousands of tasks, the possibility of a few
straggling tasks is very real.
Tasks may be slow for various reasons, including hardware degradation or software
mis-configuration, but the causes may be hard to detect since the tasks still complete
successfully.
-running tasks; instead, it tries to detect
when a task is running slower than expected and launches another, equivalent, task as
a backup.
This is termed speculative execution of tasks.

duplicate tasks at about the same time so they can race each other.
This would be wasteful of cluster resources.
When a task completes successfully, any duplicate tasks that are running are killed since
they are no longer needed.
So if the original task completes before the speculative task, then the speculative task
is killed; on the other hand, if the speculative task finishes first, then the original is
killed.
Speculative execution is an optimization, not a feature to make jobs run more reliably.
If there are bugs that sometimes cause a task to hang or slow down, then relying on

the same bugs are likely to affect the speculative task.


Speculative execution is turned on by default.
It can be enabled or disabled independently for map tasks and reduce tasks, on a cluster-
wide basis, or on a per-job basis.
15. MapReduce Features
Hadoop MapReduce
MapReduce is a programming model suitable for processing of huge data. Hadoop is
capable of running MapReduce programs written in various languages: Java, Ruby, Python,
and C++. MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster.
MapReduce programs work in two phases:
1. Map phase
2. Reduce phase.
An input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
The whole process goes through four phases of execution namely, splitting, mapping, shuffling,
and reducing.

36
Consider you have following input data for your Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad

Fig. 3.35 MapReduce Architecture


The final output of the MapReduce task is

bad 1

Class 1

good 1

Hadoop 3

is 2

to 1

Welcome 1

37
The data goes through the following phases
Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits Input
split is a chunk of the input that is consumed by a single map.
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data
in each split is passed to a mapping function to produce output values. In our example, a job of
mapping phase is to count a number of occurrences of each word from input splits (more details
about input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubbed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset. In the example, this phase aggregates the values from
Shuffling phase i.e., calculates total occurrences of each word.
MapReduce Architecture explained -
One map task is created for each split which then executes map function for each record
in the split.
It is always beneficial to have multiple splits because the time taken to process a split
is small as compared to the time taken for processing of the whole input. When the
splits are smaller, the processing is better to load balanced since we are processing the
splits in parallel.
However, it is also not desirable to have splits too small in size. When splits are too
small, the overload of managing the splits and map task creation begins to dominate the
total job execution time.
For most jobs, it is better to make a split size equal to the size of an HDFS block (which
is 64 MB, by default).
Execution of map tasks results into writing output to a local disk on the respective node
and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which takes place in
case of HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to produce the
final output.
Once the job is complete, the map output can be thrown away. So, storing it in HDFS
with replication becomes overkill.
In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
Reduce task doesn't work on the concept of data locality. An output of every map task
is fed to the reduce task. Map output is transferred to the machine where reduce task is
running.
On this machine, the output is merged and then passed to the user-defined reduce
function.

38
Unlike the map output, reduce output is stored in HDFS (the first replica is stored on
the local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
How MapReduce Organizes Work?
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by
two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.

Fig. 3.34 HDFS Architecture


A job is divided into multiple tasks which are then run onto multiple data nodes in a
cluster.
It is the responsibility of job tracker to coordinate the activity by scheduling tasks to
run on different data nodes.
Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
Task tracker's responsibility is to send the progress report to the job tracker.

39
In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to
notify him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the event of task
failure, the job tracker can reschedule it on a different task tracker.
Word Count Program with MapReduce and Java
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into
individual tasks that can be executed in parallel across a cluster of servers. The results of tasks
can be joined together to compute final results.
MapReduce consists of 2 steps:
Map Function It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Reduce Function Takes the output from Map as an input and combines those data
tuples into a smaller set of tuples.

Example (Map function in Word Count)

Set of data Bus, Car, bus, car, train, car, bus, car, train, bus,
Input
TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),
Convert into
(car,1), (bus,1), (car,1), (train,1), (bus,1),
Output another set of data
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)

Example (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1),


(train,1),
Input (car,1), (bus,1), (car,1), (train,1),
Set of Tuples
(output of Map (bus,1),
function) (TRAIN,1),(BUS,1), (buS,1), (caR,1),
(CAR,1),
(car,1), (BUS,1), (TRAIN,1)
(BUS,7),
Converts into smaller set
Output (CAR,7),
of tuples
(TRAIN,4)

40
Work Flow of the Program

Fig. 3.35 Work Flow of the Program

Workflow of MapReduce consists of 5 steps:


1. Splitting The splitting parameter can be anything, e.g. splitting by space, comma,
\
2. Mapping as explained above.
3. Intermediate splitting the entire process in parallel on different clusters. In order to
g
4. Reduce it is nothing but mostly group by phase.
5. Combining The last phase where all the data (individual result set from each cluster)
is combined together to form a result.
Steps
1. Open Eclipse> File > New > Java Project >( Name it MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
3. Right Click on Package > New > Class (Name it - WordCount).
4. Add Following Reference Libraries:
1. Right Click on Project > Build Path> Add External
1. /usr/lib/hadoop-0.20/hadoop-core.jar
2. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
5. Type the following code:
packagePackageDemo;
importjava.io.IOException;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
41
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.util.GenericOptionsParser;
publicclassWordCount {
publicstaticvoidmain(String [] args) throwsException
{
Configurationc=newConfiguration();
String[] files=newGenericOptionsParser(c,args).getRemainingArgs();
Pathinput=newPath(files[0]);
Pathoutput=newPath(files[1]);
Jobj=newJob(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1); }
publicstaticclassMapForWordCountextendsMapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritablekey, Textvalue, Contextcon) throwsIOException,
InterruptedException
{
Stringline=value.toString();
String[] words=line.split(",");
for(Stringword: words )
{
TextoutputKey=newText(word.toUpperCase().trim());
IntWritableoutputValue=newIntWritable(1);
con.write(outputKey, outputValue); }
}
}
public static classReduceForWordCountextendsReducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Textword, Iterable<IntWritable>values, Contextcon) throwsIOException,
InterruptedException
{
Int sum=0;
for(IntWritablevalue : values)
{
sum+=value.get();
}
con.write(word, newIntWritable(sum)); }
}
}
The above program consists of three classes:

42
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements
the Map function.
The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and implements
the Reduce function.
6. Make a jar file
Right Click on Project> Export> Select export destination as Jar File > next> Finish.

7. Take a text file and move it into HDFS format:

To move this into Hadoop directly, open the terminal and enter the following commands:
[training@localhost ~]$ hadoop fs -putwordcountFilewordCountFile

43
8. Run the jar file:
(Hadoop jar jarfilename.jar
packageName.ClassName PathToInputTextFilePathToOutputDirectry)
[training@localhost ~]$ hadoop jar MRProgramsDemo.jar
PackageDemo.WordCountwordCountFile MRDir1
9. Open the result:
[training@localhost ~]$ hadoop fs -ls MRDir1
Found 3 items
-rw-r--r--1 training supergroup 02016-02-2303:36 /user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup 02016-02-2303:36 /user/training/MRDir1/_logs
-rw-r--r--1 training supergroup 202016-02-2303:36 /user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6
Basic Command and Syntax for MapReduce
Java Program:

$HADOOP_HOME/bin/hadoop jar / org.myorg.WordCount /input-directory /output-directory

Other Scripting languages:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input


myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc

Input parameter mentions the input files and the output parameter mentions the output
directory. Mapper and Reducer mention the algorithm for Map function and Reduce function
respectively. These have to be mentioned in case Hadoop streaming API is used i.e; the mapper
and reducer are written in scripting language. The commands remain the same as
for Hadoop. The jobs can also be submitted using jobs command in Hadoop. All the
-
Hadoop using the following command. The following are the commands that are useful when
a job file is submitted:

hadoop job -list Displays all the ongoing jobs


hadoop job-status Prints the map and reduce completion percentage
hadoop job -kill Kills the corresponding job.
hadoop job -set-priority Set priority for Queued jobs

44

You might also like