0% found this document useful (0 votes)
54 views30 pages

Unit-3 BDA

The document provides an overview of Hadoop's history, architecture, and key components, including HDFS, MapReduce, and YARN. It details the roles of the NameNode and DataNode in HDFS, the processing capabilities of MapReduce, and the resource management functions of YARN. Additionally, it discusses other Hadoop ecosystem components like Hive, Pig, and HBase, highlighting their functionalities and importance in big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views30 pages

Unit-3 BDA

The document provides an overview of Hadoop's history, architecture, and key components, including HDFS, MapReduce, and YARN. It details the roles of the NameNode and DataNode in HDFS, the processing capabilities of MapReduce, and the resource management functions of YARN. Additionally, it discusses other Hadoop ecosystem components like Hive, Pig, and HBase, highlighting their functionalities and importance in big data processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit - 3

1. History of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search engine,
itself a part of the Lucene project.
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many other
companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-
publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through 4
terabytes of scanned archives from the paper, converting them to PDFs for the Web. The
processing took less than 24 hours to run using 100 machines, and the project probably
wouldn’t have been embarked upon without the combination of Amazon’s pay-by-the-hour
model (which allowed the NYT to access a large number of machines for a short period) and
Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort an
entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds
(just under 3.5 minutes), beating the previous year’s winner of 297 seconds. In November of
the same year, Google reported that its MapReduce implementation sorted 1terabyte in 68
seconds. Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to sort
1 terabyte in 62 seconds. The trend since then has been to sort even larger volumes of data at
ever faster rates. In the 2014 competition, a team from Databricks were joint winners of the
Gray Sort benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406
seconds, a rate of 4.27 terabytes per minute.
Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general
purpose storage and analysis platform for big data has been recognized by the industry, and
this fact is reflected in the number of products that use or incorporate Hadoop in some way.
Commercial Hadoop support is available from large, established enterprise vendors, including
EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as
Cloudera, Hortonworks, and MapR.
HADOOP BASICS:
• Need to process huge datasets on large clusters of computers
• Very expensive to build reliability into each application
• Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not constant
• Need a common infrastructure
Efficient, reliable, easy to use
Open Source, Apache Licence
Key Benefit & Flexibility

1
Client-server Concept
•Client sends requests to one or more servers which in turn accepts, processes them and return
the requested information to the client.
•A server might run software which listens on particular ipand port number for requests
Examples:
Server -web server
Client –web browser
• Very Large Distributed File System
10K nodes, 100 million files, 10PB
• Assumes Commodity Hardware
Files are replicated to handle hardware failure
Detect failures and recover from them
• Optimized for Batch Processing
Data locations exposed so that computations can move to where data resides
Provides very high aggregate bandwidth

2. The Hadoop Distributed File System (HDFS)


HDFS cluster has two types of nodes operating in a master-worker pattern:
i. A namenode – the master
ii. A number of datanodes – workers

Fig. 3.1 HDFS Architecture


Namenode – manages the filesystem namespace
• It maintains the file system tree and the metadata for all the files and directories in the
tree
• The information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log
• The namenode also knows the datanodes on which all the blocks for a given file are
located
• However, it does not store block locations persistently, because this information is
reconstructed from datanodes when the system starts

2
Datanodes – workhorses of the filesystem
• They store and retrieve blocks when they are told to (by clients or namenode)
• And they report back to the namenode periodically with lists of blocks that they are
storing
• Without the namenode, the filesystem cannot be used
• If the machine running the namenode were obliterated, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from
the blocks on the datanodes.
• For this reason, it is important to make the namenode resilient to failure

Fig. 3.2 HDFS Data Blocks

Fig. 3.3 DataNode Failure

3
• It is important to make the namenode resilient to failure;
Hadoop provides two mechanisms for this
1. The first way is to back up the files – Hadoop can be configured so that the namenode
writes its persistent state to multiple filesystems
2. Secondary namenode – does not act as a namenode. Its main role is to periodically
merge the namespace
Block Caching
• Normally, a datanode reads blocks from disk, but for frequently accessed files the
blocks may be explicitly cached in the datanode’s memory – block cache
• By default, a block is cached in only one datanode’s memory
• Job schedulers (for MapReduce, Spark and other frameworks) can take advantage of
cached blocks by running tasks on the datanode where a block is cached, for increased
read performance
• Users or applications instruct the namenode - which files to cache and for how long –
by adding a cache directive to a cache pool
• Cache pools are an administrative grouping for managing cache permission and
resource usage

4
3. Components of Hadoop

Fig. 3.4 Components of Hadoop

Fig. 3.5 HDFS Architecture

5
Fig. 3.6 MapReduce

Fig. 3.7 MapReduce

Fig. 3.8 Components of Hadoop

6
Fig. 3.9 YARN

Fig. 3.10 YARN

• Yarn which is short for Yet Another Resource Negotiator.


• It is like the operating system of Hadoop as it monitors and manages the resources.
• Yarn came into the picture with the launch of Hadoop 2.x in order to allow different
workloads.
• It handles the workloads like stream processing, interactive processing, and batch
processing over a single platform.
• Yarn has two main components – Node Manager and Resource Manager.
Node Manager
• It is Yarn’s per-node agent and takes care of the individual compute nodes in a Hadoop
cluster.
• It monitors the resource usage like CPU, memory etc. of the local node and intimates
the same to Resource Manager.
Resource Manager
• It is responsible for tracking the resources in the cluster and scheduling tasks like map-
reduce jobs.
• Also, there is
• Application Master and

7
• Scheduler
• in Yarn.
• Also, there is
• Application Master and
• Scheduler
• in Yarn.

Fig. 3.11 YARN


Application Master has two functions and they are:-
• Negotiating resources from Resource Manager
• Working with NodeManager to monitor and execute the sub-task.
Following are the functions of Resource Scheduler:-
• It allocates resources to various running applications
• But it does not monitor the status of the application.
• So in the event of failure of the task, it does not restart the same.
• Another concept Container.
• It is nothing but a fraction of NodeManager capacity i.e. CPU, memory, disk, network
etc.

Fig. 3.12 Need for YARN

8
Fig. 3.16 Application workflow in Hadoop YARN

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the
Resource Manager
The objective of Apache Hadoop ecosystem components is to have an overview of what
are the different components of Hadoop ecosystem that make Hadoop so powerful and due to
which several Hadoop job roles are available now. We will also learn about Hadoop ecosystem
components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache
Pig, Apache HBase and HBase
components, HCatalog, Avro, Thrift, Drill, Apachemahout, Sqoop, ApacheFlume, Ambari, Z
ookeeper and Apache Oozie to deep dive into Big Data Hadoop and to acquire master level
knowledge of the Hadoop Ecosystem.

10
Fig. 3.17 Hadoop Ecosystem and Their Components

The list of Hadoop Components is discussed in this section one by one in detail.
Hadoop Distributed File System
It is the most important component of Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that
provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is
a distributed filesystem that runs on commodity hardware. HDFS is already configured with
default configuration for many installations. Most of the time for large clusters configuration
is needed. Hadoop interact directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s
now discuss these Hadoop HDFS Components-
i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Tasks of HDFS NameNode
 Manage file system namespace.
 Regulates client’s access to files.
 Executes file system execution such as naming, closing, opening files and directories.
ii. DataNode
It is also known as Slave. HDFS Datanode is responsible for storing actual data in
HDFS. Datanode performs read and write operation as per the request of the clients. Replica
block of Datanode consists of 2 files on the file system. The first file is for data and second file
is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup,
each Datanode connects to its corresponding Namenode and does handshaking. Verification of

11
namespace ID and software version of DataNode take place by handshaking. At the time of
mismatch found, DataNode goes down automatically.
Tasks of HDFS DataNode
 DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode.
 DataNode manages data storage of the system.
This was all about HDFS as a Hadoop Ecosystem component.
MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data
processing. MapReduce is a software framework for easily writing applications that process
the vast amount of structured and unstructured data stored in the Hadoop Distributed File
system.
MapReduce programs are parallel in nature, thus are very useful for performing large-
scale data analysis using multiple machines in the cluster. Thus, it improves the speed and
reliability of cluster this parallel processing.

Fig. 3.18 Hadoop MapReduce


Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two
phases:
 Map phase
 Reduce phase
Each phase has key-value pairs as input and output. In addition, programmer also specifies
two functions: map function and reduce function. Map function takes a set of data and converts
it into another set of data, where individual elements are broken down into tuples (key/value
pairs). Reduce function takes the output from the Map as an input and combines those data
tuples based on the key and accordingly modifies the value of the key.

12
Features of MapReduce
 Simplicity – MapReduce jobs are easy to run. Applications can be written in any language
such as java, C++, and python.
 Scalability – MapReduce can process petabytes of data.
 Speed – By means of parallel processing problems that take days to solve, it is solved in
hours and minutes by MapReduce.
 Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable,
another machine has a copy of the same key pair which can be used for solving the same
subtask.
YARN
` Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component
that provides the resource management. Yarn is also one the most important component of
Hadoop Ecosystem. YARN is called as the operating system of Hadoop as it is responsible for
managing and monitoring workloads. It allows multiple data processing engines such as real-
time streaming and batch processing to handle data stored on a single platform.

Fig. 3.19 Hadoop Yarn Diagram


YARN has been projected as a data operating system for Hadoop2. Main features of YARN
are:
 Flexibility – Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other applications
can also be run along with Map Reduce programs in Hadoop2.
 Efficiency – As many applications run on the same cluster, hence, efficiency of Hadoop
increases without much effect on quality of service.
 Shared – Provides a stable, reliable, secure foundation and shared operational services
across multiple workloads. Additional programming models such as graph processing and
iterative modelling are now possible for data processing.

13
Hive
The Hadoop ecosystem component, Apache Hive, is an open source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive do three main
functions: data summarization, query, and analysis.Hive use language called HiveQL (HQL),
which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce
jobs which will execute on Hadoop.

Fig. 3.20 Hive Diagram


Main parts of Hive are:
 Metastore – It stores the metadata.
 Driver – Manage the lifecycle of a HiveQL statement.
 Query compiler – Compiles HiveQL into Directed Acyclic Graph (DAG).
 Hive server – Provide a thrift interface and JDBC/ODBC server.
Pig
Apache Pig is a high-level language platform for analyzing and querying huge dataset
that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses Pig Latin language. It
is very similar to SQL. It loads the data, applies the required filters and dumps the data in the
required format. For Programs execution, pig requires Java runtime environment.

14
Fig. 3.21 Pig Diagram

Features of Apache Pig:


 Extensibility – For carrying out special purpose processing, users can create their own
function.
 Optimization opportunities – Pig allows the system to optimize automatic execution. This
allows the user to pay attention to semantics instead of efficiency.
 Handles all kinds of data – Pig analyzes both structured as well as unstructured.
HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that
was designed to store structured data in tables that could have billions of row and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer.
i. HBase Master
It is not part of the actual data storage but negotiates load balancing across all
RegionServer.
 Maintain and monitor the Hadoop cluster.
 Performs administration (interface for creating, updating and deleting tables.)
 Controls the failover.
 HMaster handles DDL operation.
ii. RegionServer
It is the worker node which handles read, writes, updates and delete requests from
clients. Region server process runs on every node in Hadoop cluster. Region server runs on
HDFS DateNode.

15
Fig. 3.22 HBase Diagram
HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different
components available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and
write data from the cluster. HCatalog is a key component of Hive that enables the user to store
their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.
Benefits of HCatalog:
 Enables notifications of data availability.
 With the table abstraction, HCatalog frees the user from overhead of data storage.
 Provide visibility for data cleaning and archiving tools.
Avro
Avro is a part of Hadoop ecosystem and is a most popular Data serialization
system. Avro is an open source project that provides data serialization and data exchange
services for Hadoop. These services can be used together or independently. Big data can
exchange programs written in different languages using Avro.Using serialization service
programs can serialize data into files or messages. It stores data definition and data together in
one message or file making it easy for programs to dynamically understand information stored
in Avro file or message.

16
Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema
for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.
Dynamic typing – It refers to serialization and deserialization without code generation. It
complements the code generation which is available in Avro for statically typed language as
an optional optimization.
Features provided by Avro:
 Rich data structures.
 Remote procedure call.
 Compact, fast, binary data format.
 Container file, to store persistent data.
Thrift
It is a software framework for scalable cross-language services development. Thrift is
an interface definition language for RPC(Remote procedure call) communication. Hadoop does
a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift
for performance or other reasons.

Fig. 3.23 Thrift Diagram


Apache Drill
The main purpose of the Hadoop Ecosystem Component is large-scale data processing
including structured and semi-structured data. It is a low latency distributed query engine that

17
is designed to scale to several thousands of nodes and query petabytes of data. The drill is the
first distributed SQL query engine that has a schema-free model.
Application of Apache drill
The drill has become an invaluable tool at cardlytics, a company that provides consumer
purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process
trillions of record and execute queries.
Features of Apache Drill:
The drill has specialized memory management system to eliminates garbage collection
and optimize memory allocation and usage. Drill plays well with Hive by allowing developers
to reuse their existing Hive deployment.
 Extensibility – Drill provides an extensible architecture at all layers, including query layer,
query optimization, and client API. We can extend any layer for the specific need of an
organization.
 Flexibility – Drill provides a hierarchical columnar data model that can represent complex,
highly dynamic data and allow efficient processing.
 Dynamic schema discovery – Apache drill does not require schema or type specification
for data in order to start the query execution process. Instead, drill starts processing the
data in units called record batches and discover schema on the fly during processing.
 Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill does not
have centralized metadata requirement. Drill users do not need to create and manage tables
in metadata in order to query data.
Apache Mahout
Mahout is open source framework for creating scalable machine learning algorithm and
data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science
tools to automatically find meaningful patterns in those big data sets.
Algorithms of Mahout are:
 Clustering – Here it takes the item in particular class and organizes them into naturally
occurring groups, such that item belonging to the same group are similar to each other.
 Collaborative filtering – It mines user behaviour and makes product recommendations
(e.g. Amazon recommendations)
 Classifications – It learns from existing categorization and then assigns unclassified items
to the best category.
 Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or
terms in query session) and then identifies which items typically appear together.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components
like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop
works with relational databases such as teradata, Netezza, oracle, MySQL.
Features of Apache Sqoop:
 Import sequential datasets from mainframe – Sqoop satisfies the growing need to move
data from the mainframe to HDFS.
 Import direct to ORC files – Improves compression and light weight indexing and improve
query performance.

18
 Parallel data transfer – For faster performance and optimal system utilization.
 Efficient data analysis – Improve efficiency of data analysis by combining structured data
and unstructured data on a schema on reading data lake.
 Fast data copies – from an external system into Hadoop.

Fig. 3.24 Apache Sqoop Diagram


Apache Flume
Flume efficiently collects, aggregate and moves a large amount of data from its origin
and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop
Ecosystem component allows the data flow from the source into Hadoop environment. It uses
a simple extensible data model that allows for the online analytic application. Using Flume, we
can get the data from multiple servers immediately into hadoop.

Fig. 3.25 Apache Flume

19
Ambari
Ambari, another Hadoop ecosystem component, is a management platform for
provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management
gets simpler as Ambari provide consistent, secure platform for operational control.

Fig. 3.26 Ambari Diagram


Features of Ambari:
 Simplified installation, configuration, and management – Ambari easily and efficiently
create and manage clusters at scale.
 Centralized security setup – Ambari reduce the complexity to administer and configure
cluster security across the entire platform.
 Highly extensible and customizable – Ambari is highly extensible for bringing custom
services under management.
 Full visibility into cluster health – Ambari ensures that the cluster is healthy and available
with a holistic approach to monitoring.
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for
maintaining configuration information, naming, providing distributed synchronization and
providing group services. Zookeeper manages and coordinates a large cluster of machines.

20
Fig. 3.27 ZooKeeper Diagram
Features of Zookeeper:
 Fast – Zookeeper is fast with workloads where reads to data are more common than writes.
The ideal read/write ratio is 10:1.
 Ordered – Zookeeper maintains a record of all transactions.
Oozie
It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines
multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated
with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for
apache MapReduce, Pig, Hive, and Sqoop.

Fig. 3.28 Oozie Diagram


In Oozie, users can create Directed Acyclic Graph of workflow, which can run in
parallel and sequentially in Hadoop. Oozie is scalable and can manage timely execution of
thousands of workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easily
start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it
in Oozie.
There are two basic types of Oozie jobs:

21
 Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, pig, Hive.
 Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability
of data.
This was all about Components of Hadoop Ecosystem

4. Analyzing the Data with Hadoop


Map and Reduce
• MapReduce works by breaking the processing into two phases:
– the map phase and the reduce phase.
• Each phase has key-value pairs as input and output, the types of which may be chosen
by the programmer.
• The programmer also specifies two functions: the map function and the reduce function.
• The input to our map phase is the raw NCDC data.
• A text input format is chosen that gives each line in the dataset as a text value.
• The key is the offset of the beginning of the line from the beginning of the file
Map function is simple.
• year and the air temperature is pulled out, since these are the only fields - interested in.
• In this case, the map function is just a data preparation phase, setting up the data in such
a way that the reducer function can do its work on it: finding the maximum temperature
for each year.
• The map function is also a good place to drop bad records: here temperatures are filtered
out - that are missing, suspect, or erroneous.

Fig. 3.29 MapReduce Logical Dataflow


Java MapReduce
• Having run through how the MapReduce program works, the next step is to express it
in code.
• Need three things: a map function, a reduce function, and some code to run the job.
Example Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

22
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92)); }
String quality = line.substring(92, 93);
if (airTemperature != MISSING &&quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature)); } } }
Example Reducer for maximum temperature example
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}
Example Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

23
public class MaxTemperature {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature<input path><output path>");
System.exit(-1); }
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf); } }
5. Scaling Out
• data flow for large inputs.
• For simplicity, the examples so far have used files on the local filesystem.
• However, to scale out,
– data should be stored in a distributed filesystem, typically HDFS,
– to allow Hadoop to move the MapReduce computation
– to each machine hosting a part of the data.
• Having many splits means the time taken to process each split is small compared to the
time to process the whole input.
• So if we are processing the splits in parallel, the processing is better load-balanced if
the splits are small, since a faster machine will be able to process proportionally more
splits over the course of the job than a slower machine.
• On the other hand,
• if splits are too small, then the overhead of managing the splits and of map task creation
begins to dominate the total job execution time.
• For most jobs, a good split size tends to be the size of an HDFS block, 64 MB by default,
although this can be changed for the cluster
• Hadoop does its best to run the map task on a node where the input data resides in
HDFS. This is called the data locality optimization.
• It should now be clear why the optimal split size is the same as the block size: it is the
largest size of input that can be guaranteed to be stored on a single node.
• Map tasks write their output to the local disk, not to HDFS.
Why is this?
• Map output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete the map output can be thrown away.
• So storing it in HDFS, with replication, would be overkill.
• If the node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node to re-
create the map output.

24
Fig. 3.30 MapReduce Data flow with the single reduce task

Fig. 3.31 MapReduce Data flow with the multiple reduce tasks

25
Fig. 3.32 MapReduce Data flow with the no reduce task
Combiner Functions
• Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays
to minimize the data transferred between map and reduce tasks.
• Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function.
• Since the combiner function is an optimization, Hadoop does not provide a guarantee
of how many times it will call it for a particular map output record, if at all.
• In other words, calling the combiner function zero, one, or many times should produce
the same output from the reducer
• The contract for the combiner function constrains the type of function that may be used.
• This is best illustrated with an example.
• Suppose that for the maximum temperature example, readings for the year 1950 were
processed by two maps (because they were in different splits).
• Imagine the first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
• And the second produced:
(1950, 25)
(1950, 15)
• The reduce function would be called with a list of all the values:
(1950, [0, 20, 10, 25, 15])
• with output:
(1950, 25)
• since 25 is the maximum value in the list.
A combiner function can be used - just like the reduce function, finds the maximum temperature
for each map output.
• The reduce would then be called with:
(1950, [20, 25])

26
• and the reduce would produce the same output as before.
• More succinctly, we may express the function calls on the temperature values in this
case as follows:
max (0, 20, 10, 25, 15) = max (max (0, 20, 10), max (25, 15)) = max (20, 25) = 25
• If calculating mean temperatures, then we couldn’t use the mean as our combiner
function, since:
mean (0, 20, 10, 25, 15) = 14
• but:
mean (mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
• The combiner function doesn’t replace the reduce function.
• But it can help cut down the amount of data shuffled between the maps and the reduces,
and for this reason alone it is always worth considering whether a combiner function
can be used in the MapReduce job.

6. Hadoop Streaming
• Hadoop provides an API to MapReduce that allows writing map and reducing functions
in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
the program, so any language can be used that can read standard input and write to
standard output to write the MapReduce program.
• Streaming is naturally suited for text processing and when used in text mode, it has a
line-oriented view of data.
• Map input data is passed over standard input to the map function, which processes it
line by line and writes lines to standard output.
• A map output key-value pair is written as a single tab-delimited line.
• Input to the reduce function is in the same format—a tab-separated key-value pair—
passed over standard input.
• The reduce function reads lines from standard input, which the framework guarantees
are sorted by key, and writes its results to standard output.
• ------Ruby, Python

7. Design of HDFS
Distributed File System
• File Systems that manage the storage across a network of machines
• Since they are network based, all the complications of network programming occur
• This makes DFS more complex than regular disk file systems
– For example, one of the biggest challenges is making the filesystem tolerate
node failure without suffering data loss
• HDFS – a file system designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
• Very Large files
– Hundreds of megabytes, gigabytes or terabytes in size
– There are Hadoop clusters running today store petabytes of data
• Streaming Data Access
– HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern

27
– A dataset is typically generated or copied from source, and then various analyses
are performed on that dataset over time
• Commodity Hardware
– Hadoop doesn’t require expensive, highly reliable hardware.
– It is designed to run on clusters of commodity hardware(commonly available
hardware that can be obtained from multiple vendors)
– Chance of node failure is high, atleast for large clusters
– HDFS is designed to carry on working without interruption to the user in the
face of such failure
Areas where HDFS is not good fit today
• Low-latency data access
– Applications that require low-latency access to data, in the tens of milliseconds
range, will not work well with HDFS
– HDFS is designed for delivering a high throughput of data and this may be at
the expense of latency
– Hbase – better choice of low-latency access
• Lots of small files
– Because the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on the
namenode
– As a rule of thumb, each file, directory and block takes about 150 bytes.
– So, e.g.,if one million files, each taking one block, would need at least 300 MB
of memory
• Multiple writers, arbitrary file modifications
– Files in HDFS may be written to by a single writer
– Writers are always made at the end of the file, in append-only fashion
– There is no support for multiple writers or for modifications at arbitrary offsets
in the file
9. How MapReduce Works
10. Anatomy of a MapReduce Job run
Anatomy of a MapReduce Job Run
• Can run a MapReduce job with a single method call:
submit () on a Job object
• Can also call
waitForCompletion() – submits the job if it hasn’t been submitted already, then
waits for it to finish
At the highest level, there are five independent entries:
• The client, which submits the MapReduce job
• The YARN resource manager, which coordinates the allocation of computer resources
on the cluster
• The YARN node managers, which launch and monitor the compute containers on
machines in the cluster
• The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in containers
that are scheduled by the resource manager and managed by the node managers.
• The DFS (normally HDFS), which is used for sharing job files between the other
entities.

30
Fig. 3.33 Anatomy of MapReduce Job Run
Job Submission
• The submit() method on job creates an internal JobSubmitter instance and calls
submitJobInternal() on it (step 1 in figure)
• Having submitted the job, waitForCompletion() polls the job’s progress once per
second and reports the progress to the console if it has changed since last report
• When the job completes successfully, the job counters are displayed.
• Otherwise, the error that caused the job to fail is logged to the console.
• The job submission process implemented by JobSubmitter does the following
• Asks the resource manager for a new application ID, used for the MapReduce job
ID(step 2)
• Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program
• Computes the input splits for the job. If the splits cannot be computed (because of the
input paths don’t exist, for example), the job is not submitted and an error is thrown to
the MapReduce program
• Copies the resources needed to run the job, including the job JAR file, the configuration
file and the computed input splits, to the shared filesystem in a directory named after
the job ID (step 3). The job JAR is copied with a high replication factor so that there
are lots of copies across the cluster for the node managers to access when they run tasks
for the job.
• Submits the job by calling submitApplication() on the resource manager (step 4)

31
Job Initialization
• When the resource manager receives a call to its submitApplication() method, it
handsoff the request to the YARN scheduler.
• The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management (steps 5a
and 5b)
• The application master for MapReduce jobs is a Java application whose main class
isMRAppMaster.
• It initializes the job by creating a number of bookkeeping objects to keep track of the
job’s progress, as it will receive progress and completion reports from the task (step 6)
• Next, it retrieves the input splits computed in the client from the shared filesystem (step
7)
• It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property (set by the setNumReduceTasks()
method on Job). Tasks are given IDs at this point
• The application master must decide how to run the tasks that make up the MapReduce
job
• If the job is small, the application master may choose to run the tasks in the same JVM
as itself.
• This happens when it judges that the overhead of allocating and running tasks in new
containers outweighs the gain to be had in running them in parallel, compared to
running them sequentially on one node. Such a job is said to be uberized or run as an
uber task.
What qualifies as a small job?
• By default,
• a small job is one that has
– less than 10 mappers,
– only one reducer and
– an input size that is less than the size of one HDFS block
• Before it can run the task, it localizes the resources that the task needs, including the
job configuration and JAR file and any files from the distributed cache (step 10).
• Finally it runs the map or reduce task (step 11)
• The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and
reduce functions don’t affect the node manager – by causing it to crash or hang.
Streaming
• Streaming runs special map and reduce tasks for the purpose of launching the user-
supplied executable and communicating with it
Job Completion
• When the application master receives a notification that the last task for a job is
complete, it changes the status for the job to “successful”
• Finally, on job completion, the application master and the task containers clean up their
working state

32

You might also like