0% found this document useful (0 votes)

19 views19 pages

BD Sec B

Apache Hadoop, created in 2005 by Doug Cutting and Mike Cafarella, is an open-source framework for distributed processing of large datasets across clusters of computers. It consists of several components, including Hadoop Common, HDFS for data storage, and YARN for resource management, allowing for scalable and flexible big data solutions. Hadoop's architecture follows a Master-Slave model, with a NameNode managing metadata and DataNodes storing actual data, enabling efficient data processing and fault tolerance.

Uploaded by

Prerna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

BD Sec B

Uploaded by

Prerna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Apache Hadoop

History: Hadoop was created by Doug Cutting and Mike Cafarelta in 2005. It was originaliy developed to
support distribution for the Nutch search engine project. Doug
Cutting, who was working at Yahoo! at the time
and is now Chief Architect of Cloudera, named the project after his
son's toy elephant. Cutting's
son was
years old at the time and just beginning to talk. He calted his 2
favorite stuffed yellow elephant "Hadoop
(with the stress on the first
syllable). Now 12, Doug's son often exclaims, "Why don't you say my name, and
why don't I get royalties? t deserve to be famous for this!"
Hadoop is an open-source framework from
the ASF Apache Software Foundation. Open source project
means it is freely available and we can even
change its
source code as per the requirements. If certain
functionality does not fülfill your need then you can change it according to your need. Most of Hadoop code
is written by Yahoo. IBM, Facebook, Cloudera.

Apache Hadoop framework is written in Java. The basic Hadoop programming language is Java, but this
does not mean you can code only in Java. You can code in C, C++, Perl, Python, ruby etc. You can code
the Hadoop framework in any language but it will be better to code in java as you will have lower level
control of the code.

Hadoop framework allows distributed processing of large datasets across clusters (i.e. a group of systems
connected via LAN) of computers using simple programming models. A Hadoop frame-worked application
works in an environment that provides distributed storage and computation across clusters of commodity
hardware (i.e. low-end hardware, they are cheap devices which are very economical. Hence, Hadoop is very
economic.). Hadoop is designed to scale up from single server to thousands of machines, each offering local
computation and storage.

It is at the center of a growing ecosystem of big data technologies that are primarily used to support
advanced analytics initiatives, including predictive analytics, data mining and learning applications. Hadoop
can handle various forms of structured and unstructured data, giving users more flexibility for collecting,
processing and analyzing data than relational databases and data warehouses provide.

The Apache Hadoop framework is composed of the following modules

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries
provide filesystem and OS level abstractions and contains the necessary Java files and scripts required to
start Hadoop.
Hadoop Distributed File System (HDFS): The Hadoop Distributed File System (HDFS) is the primary
data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to
implement a distributed file system (DFS) that provides high-performance access to data across highly
scalable Hadoop clusters. HDFS creates multiple replicas of data blocks and distributes them on compute
nodes in cluster.
YARN is the resource management and job scheduling technology in the
Hadoop YARN: Apache Hadoop
framework. One of Apache Hadoop's core components, YARN is
open source Hadoop distributed processing
to the various applications running in a Hadoop cluster and
responsible for allocating system resources
different cluster nodes.
scheduling tasks to be executed on

for parallel processing of large data sets. lt is a software

Hadoop MapReduce: A YARN-based system
framework for easily writing applications which process big amounts of data in-parallel on large clusters
of commodity hardware in a reliable, fault-tolerant manner.
(thousands of nodes)
is now
Since 2012, Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform"
related projects as well: Apache Pig, Apache Hive, Apache
commonly considered to consist of a number of
HBase, Apache Spark and others.
Hadoop Architecture
High Level Architecture of Hadoop
Master Node Slave Node Slave NGde
wnwwwww. .
TaskTiacker Task Tracker TaskTrocker

***

MapReduce layer JobTracker

HDFS layer NameNode

**vowwmme

*w*w.n
DataNode DataNode DataNode

Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework for a cluster of
systems with storage capacity and local computing power by leveraging commodity hadware. Hadoop follows
a Master Slave architecture for the transformation and analysis of large datasets ousing Hadoop
MapReduce paradigm. The 3 important hadoop components that play a vital role in the Hadoop architecture are
1) Hadoop Distributed File System (HDFS)- Patterned after the UNIX file system

2) Hadoop MapReduce
3) Yet Another Resource Negotiator (YARN)
Hadoop follows a maater slave architecture design for data storage and distributed data processing
is the NameNode
using HDFS and MapReduce respectively. The master node for data storage is hadoop HDFS
and the master node for parallel processing of data using Hadoop MapReduce is the Job Tracker. The slave
nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform

complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the
processes with the Job Tracker and
NameNode respectively. In Hadoop architectural implementation the master
or slave systems can be setup in the cloud or on-premise.
which consists of one JobTracker, to which client
Above the file systems comes the MapReduce engine,
The JobTracker pushes work out to available TaskTracker nodes in the
applications submit MapReduce jobs.
JobTracker knows which node contains the data, and which other
cluster. With a rack-aware file system, the
on the actual node where the data resides, priority is given to
machines are nearby. If the work cannot be hosted
traffic on the main backbone network.
nodes in the same rack. This reduces network
of the job is rescheduled. 'The TaskTracker on each node initiates a
If a TaskTracker fails or times out, that part
to prevent the Tasklracker itself from failing if the running job crashes
separate Java Virtual Machine process
TaskTracker to the Job| racker every few minutes to check its status. The
the JVM. A heartbeat is sent from the
information is exposed and can be viewed from a web browser.
Job Tracker and TaskTracker status and
If the JobTracker failed, all ongoing work was lost. Hadoop added some check pointing to this process. The
When a JobI racker starts up, it looks for any such data, so
JobTracker records what it is up to in the file system.
that it can restart work from where it left off.
1. Hadoop Distributed File System
HDFS is the primary storage system of Hadoop. It is a java based file system that provides scalable, fault
tolerance, reliable and cost etticient data storage for Big data. HDFS is a distributed filesystem that runs On
commodity hardware. Hadoop interact directly with HDFS by shel1-like commands.

HDFS Architecture
Hadoop HDFS architecture consists of a Master/Slave architecture in
which Master is Name Node that stores
meta-dta and Slave is DataNode that stores the actual
data. HDFS Architecture consists of single NameNode
and all the other nodes are DataNodes. Nodes are
arranged in racks and Replicas of data blocks are stored on
different racks in the cluster to provide fault tolerance.
HDFS Architecture
Meta Data Ops Meta data (Name, replicas,...):
Name /homo/fooldata, 3, .

Client Block Oos

Write
Replication

Blocks
D a ta NO de s Data Nodes
Rack 1 Rack 2
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode.

i. NameNode
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata
i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored and other details. It
consists of files and directories.
Tasks of HDFS NameNode
.Manage file system namespace.
Regulates client's access to files.
Executes file system execution such as naming, closing, opening files and directories.
A l l DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop cluster. It ensures that
the DataNodes are alive. A block report contains a list of all blocks on a datanode.
NameNode is also responsible for taking care of the Replication Factor of all the blocks.

FsImage
Fslmage in NameNode is an "Image file". Fslmage contains the entire filesystem namespace and stored as a
file in the namenode's local file system. It also contains a serialized form of all the
directories and file inodes in
the filesystem. Each inode is an internal representation of file or directory's metadata.

EditLogs-
EditLogs contains all the recent modifications made to the file system on the most recent Fslmage. Namenode
receives a create/update/delete request from the client. After that this request is first recorded to edits file.
ii. DataNode
t is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFs. Datanode
performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2
files on the file system. The first file is for data and second file is for recording the block's metadata. HDFS
Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and
does handshaking. Verification of
namespace ID and software version of DataNode take place by handshaking.
At the time of mismatch found, DataNode
goes down automatically.

Tasks of HDFS DataNode

.DataNode performs operations like block rcplica creation, deletion, and replication according to the
instruction of NameNode.
DataNode manages data storage of the system.
.DataNodes send heartbeat to the NameNode to report the health of HDFS. By default, this frequency is set
to 3 seconds.

Secondary NameNode?

In HDFS, when NameNode starts, first it reads HDFS state from an image file, Fslmage. After that, it applies
edits from the edits log file. NameNode then writes new HDFS state to the Fsimage. Then it starts normal
operation with an empty edits file. At the time of start-up, NameNode merges Fslmage and edits files, so the
edit log file could get very large over time. A side effect of a larger edits file is that next restart of Namenode
takes longer.

Secondary Namenode solves this issue. Secondary NameNode downloads the Fslmage and EditLogs from the
NameNode. And then merges EditLogs with the Fsimage (FileSystem Image). It keeps edits log size within a
limit. It stores the modified FsImage into persistent storage. And we can use it in the case of NameNode
failure. Secondary NameNode performs a regular checkpoint in HDFS.

Checkpoint Node?

The Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node in
Hadoop first downloads Fsimage and edits from the Active Namenode. Then it merges them (Fslmage and
edits) locally, and at last, it uploads the new image back to the active NameNode. It stores the latest checkpoint
in a directory that has the same structure as the Namenode's directory. This permits the checkpointed image to
be always available for reading by the namenode if necessary.

Backup Node?

A Backup node provides the same checkpointing functionality as the Checkpoint node. In Hadoop, Backup
node keeps an in-memory, up-to-date copy of the file system namespace. It is always synchronized with the
active NameNode state. The backup node in HDFS Architecture does not need to download Fslmage and edits
files from the active NameNode to create a checkpoint. It already has an up-to-date state of the namespace state
in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into
the local Fslmage file and reset edits. NameNode supports one Backup node at a time.

HDFS Read/Write Operation

3.9.1. Write Operation
When a client wants to write a file to HDFS, it communicates to namenode (master) for metadata. The
Namenode responds with a number of blocks, their location, replicas, address of the datanodes(slaves) on
which client will start writing the data and other details. Based on information from Namenode, client split files
5
into multiple blocks.
After that, it starts
sending them to first Datanode. The client first sends block A to
Datanode with
other
two Datanodes details.
When Datanode I receives block A from the ctient,
Datarnode
copy same block to Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transer
via rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in different
racks so block transfer via an out-of-rack switch. When Datanode receives the blocks from the client, it sends
write confirmation to Namenode. The same process is repeated for each blocko the file.

HDES reate Distributed -2create_ NameNode

Client
wri5. te FileSystem ~~~******
7.complete
NameNode
clos FSData
client JVM Outputstream
client ngde
4.1 write packet 5.3 acknowladgement

Pipeline of
DataNod
16 89

Datangdes (OPata

3.9.2. Read Operation

To read from HDFS, the first client communicates to namenode (master) for metadata. The Namenode
responds with number of blocks, their location, replicas and other details. The namenode checks for required
privileges, if the client has sufficient privileges then namenode provides the address of the slaves where a file is
stored. Now client will interact directly with the respective datanodes to read the data blocks. The client starts
reading data parallel from the Datanodes based on the information received from the namenode. When client or
application receives allthe block of the file, it combines these blocks into the form of an original file.

Open Distributed 2.***********

Get block locations NameNode
HDFS
FileSystem
Client
Rea
. Clos ESData
Namenode
Se Inputstream
client VM

clientnode
4. Read Read

DataNode
Datanode
E
Blocks in HDFS Architecture?
HDFS in Apache Hadoop split huge files into small chunks known as Blocks. These are the smallest unit of

data in a filesystem. We (client and admin) do not have any control on the block like block location. NameNode
decides all such things. The default size of the HDFS block is 128 MB, which we can configure as per the need.
All blocks ofthe file are of the same size except the last block, which can be the same size or smaller.

If the data size is less than the block size, then block size will be equal to the data size. For example, if the file
size is 129 MB, then it will create 2 blocks. One block will be of default size 128 MB. The other will be I MB
only and not 128 MB as it will waste the space. Hadoop is intelligent enough not to waste rest of 127 MB. So it
is allocating IMb block only for IMB data. The major advantages of storing data in such block size are that it
saves disk seek time.

Replication Management
Block replication provides fault tolerance. If one copy is not accessible and corrupted, we can read data from
other copy. The number of copies or replicas of each block of a file in HDFS Architecture is replícation factor.
The default replication factor is 3 which are again configurable. So, each block replicates three times and stored
on different DataNodes.

Ifwe are storing a file of 128 MB in HDFS using the default configuration, we will end up occupying a space of
384 MB (3*128 MB).

NameNode receives block report from DataNode periodically to maintain the replication factor. When a block
is over-replicated/under-replicated the NameNode add or delete replicas as needed.

Rack Awareness in HDFS Architecture

In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file,
NameNode chooses the DataNode which is closer to the same rack or nearby rack to Read /write request.
NameNode achieves rack information by maintaining the rack ids of each DataNode. Rack Awareness in
Hadoop is the concept that chooses Datanodes based on the rack information.

In HDFS Arehitecture, NameNode makes sure that all the replicas are not stored on the same rack or single
rack. It follows Rack Awareness Algorithm to reduce latency as well as fault tolerance. We know that default
replication factor is 3. According to Rack Awareness Algorithm first replica of a block will store on a local rack.
The next replica will store another datanode within the same rack. The third replica will store on diferent rack
In Hadoop. "No more than one replica is placed on one node. And no more than two replicas are placed on the
same rack. This has a coñstraint that the number of racks used for block replication should be less than the total
number of block replicas".

Rack Awareness is important to improve:

Data high availability and reliability.

The performance of the cluster.
T o improve network bandwidth.
Avoid losing data if entire rack fails though the chance of the rack failure is far less than that of node
failure.
To keep bulk data in the rack when possible.
An assumption that in-rack id's higher bandwidth, lower latency.
Hadoop MapReduce
What is MapReduce?
MapReduce is the data processing layer of
that process the vast amount of structured Hadoop.
t is a software framework for
easily writing applications
and unstructured data stored in the Distributed Filesystem
(HDFS). t processes the huge amount of data in Hadoop
parallet by dividing the job (submitted job)
into a set of
independent tasks (sub-job). By this paraltel processing speed and reliability of cluster is improved.
How Hadoop MapReduce Works?
In Hadoop, MapReduce works by breaking the data processing into two
phase. The phases: Map phase and Reduce
map is the first phase of processing, where we specify all the
code. Reduce is the second complex logic/business rules/costly
phase of processing, where we specify light-weight processing like
aggregation/summation.

(Pata

MapReduCe Job Execution Flow

npitsplt oaReade Combiner Partitloner

Reducer
npu Output
data hufli data
stored nputsplit ordRe Partitloner stored
on

nputsplit ombiner Partitioner Reduce

ecerdReade

Key value nte

MapReduce Flow Chart

1. Input Files
The data for a MapReduce task is stored in input files, and input files typically lives in HDFS. The format of
these files is arbitrary, while line-based log files and binary format can also be used.

2. InputFormat
It defines how these input files are split and read. It selects the files or other objects that are used for input.
InputFormat creates InputSplit.
3. InputSplits
It is created by InputFormat, logically represent the data which will be processed by an individual Mapper .
One map task is created for each split; thus the number of map tasks will be equal to the number of
InputSplits. The spiit is divided into records and each record will be processed by the mapper.

4. RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts the data into key-value

pairs suitable for reading by the mapper. By default, it uses TextlnputFormat for converting data into a key-
value pair. RecordReader communicates with the InputSplit until the file reading is not completed. It assigns
byte offset (unique number) to each line present in the file. Further, these key-value pairs are sent to the
mapper for further processing.
5. Mapper
t processes each input record (from RecordReader) and generates new key-value pair, and this key-value
pair generated by Mapper is completely different from the input pair. The output of Mapper is also known as
intermediate output which is written to the local disk. The output of the Mapper is not stored on HDFS as
this is temporary data and writing on HDFS will create unnecessary copies (also HDFS is a high latency
system). Mappers output is passed to the combiner for further process.
6. Combiner
The combiner is also known as Mini-reducer'. Hadoop MapReduce Combiner performs local
on the mappers' output, which helps to minimize the data transfer between mapper and reducer. Once the
aggregation
combiner functionality is executed, the output is then passed to the partitioner for further work.
7. Partitioner
It comes into the picture if we are working on more than one reducer
(for one reducer partitioner is not used).
It takes the output from combiners and performs partitioning. Partitioning of output takes place on the basis
of the key and then sorted. By hash function, key is used to derive the partition.
According to the key value in MapReduce, each combiner output is partitioned, and a record having the same
key value goes into the same partition, and then each partition is sent to a reducer. Partitioning allows even
distribution of the map output over the reducer.
8. Shufling and Sorting
The shuffling is the physical movement of the data which is done over the network. Once all the mappers are
finished and their output is shuffled on the reducer nodes, then this intermediate output is merged and sorted,
which is then provided as input to reduce phase.
9. Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the input and then runs a reducer
function on each of them to generate the output. The output of the reducer is the final output, which is stored
in HDFS.
10. RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
11. OutputFormat
The way these output key-value pairs are written in output files by RecordWriter is determined by the
OutputFormat. OutputFormat instances provided by the Hadoop are used to write files in HDFS or on the
local disk. Thus the final output of reducer is written on HDFS by OutputFormat instances.
Hence, in this manner, a Hadoop MapReduce works over the cluster.
Hadoop Mapper
for
In mapper task, the output is the full collection of all these <key, value> pairs. Before writing the output
This
each mapper task, partitioning of output take place on the basis of the key and then sorting is done.
one
partitioning specifies that all the values for each key are grouped together. MapReduce frame generates
the InputFormat for the job.Mapper only understandskey,
map task for each InputSplit generated I
value pairs of data, so before passing data to the mapper, data should be first converted
into <key, value

pairs.

How does Hadoop Mapper work?

for the Hadoop mapper. To read the
InputSplits converts the physical representation of the block into logical
each block and one RecordReader and
100MB file, two InputSlits are required. One InputSplit is created for
do not always depend on the number of blocks,
we
one mapper are created for each InputSplit. InputSpits
file by setting mapred.max.split.size property during job
can customize the number of splits for a particular

execution.

How many map tasks in Hadoop? the

the number of map tasks in a program. For maps,
The total number of blocks of the input files handles
although for CPU-Iight map tasks it has been set up
to
right level of parallelism is around 10-100 maps/node,
better if the maps take at least a minute to execute.
300 maps. Since task setup takes some time, so it's
For example, if we have a block size of 128
MB and we expect 10TB of input data, we will have 82,000
the determines the number of maps.
InputFormat
maps. Thus,
Hence, No. of Mapper= {(total data size)/ (input split size)}
For example, if data size is 1 TB and InputSplit size
is 100 MB then,
No. of Mapper- (1000*1000)/100 10,000

MapReduce wordcount example in execution:

Mapping shufling Reducing
Splitting
Tiger 1

Tiger1 -Tier2
Lion 1
Lion Tiger River Tiger 1
Bus 1
Output
River 1
Bus 1 Bus3 Bus 3

Input Bus 1 Lion 2

Bus1 River
Lion Tiger River Tiger 2
Bus Bus River Bus1
Lion1
Bus Bus River
Lion Bus Tiger
River 1
Lion1 Lion 2
Lion 1
Lion Bus Tiger Bus 1 River 1
Tiger 1 River 1 Roer2

Map Reduce
MapReduce Applications used at:
Google:
To create index which is used by google search engine for retrieving search results.
To compute the page rank of the web pages.
Statistical machine translation for translating between different languages.
Facebook:
. Data Mining.

Ad optimization.
Yahoo:
Spam detection for Yahoo! mail
Yahoo! search

Hadoop Optimization or Job Optimization Techniques

There are various ways to improve the Hadoop optimization. Let's discuss each of them one by one-

2.1. Proper configuration of your cluster

Dfs and MapReduce storage have been mounted with -noatime option. This disables access time and can
improve I/0 performance.
Avoid RAID on TaskTracker and datanode machines, it generally reduces
performance.
Make sure you have configured mapred.local.dir and dfs.data.dir to point to one directory on each of
your disks to ensure that all of your /O capacity is used.
Ensure that you have smart monitoring to the health status of your disk drives. This is I of the best
practice for Hadoop MapReduce performance tuning. MapReduce jobs ae fault tolerant, but dying
disks can cause performance to degrade as tasks must be re-executed.
Monitor the graph of swap usage and network usage with software like ganglia, Hadoop monitoring
metrics. If you see swap being used, reduce the amount of RAM allocated to each task
in mapred.child.java.opts

2.2. LZO compression usage

This is always a good idea for Interinediate data. Almost every Hadoop job that generates a non-negligible
amount of map output will benefit from intermediate data compression with Lzo. Although LZ0 adds a
little bit of CPÜ overhead, it saves time by reducing the amount of disk 10 during the shuffle.

In order to enable LZO compression set mapred.compress.map.output to true. This is one of the most
important Hadoop optimization techniques.

2.3. Proper tuning of the number of MapReduce tasks

30-40 seconds then reduce the number of tasks. The start
If each task takes or more,

of mapper or reducer process involves following things: first, you

need (JVM loaded into
to start JVM
the memory), then you need to initialize JVM and after processing (mapper/reducer) you need to de
costly. Now consider a case where mapper runs a task just for
initialize JVM. All these JVM tasks are
20-30 seconds and for this we have to start/initialize/stop JVM, which might take a considerable amount
of time. It is recommended to run the task for at least 1 minute.
If a job has more than 1TB of input, you should consider
increasing the block size of the input dataset to
256M or even 512M so that the number of tasks will
be smatler. You can change the block size of
existing files by using the command Hadoop distep -Hdfs.block.size=S[256*1024*1024]
/path/to/inputdata /path/to/inputdata-with-largeblocks
So long as each task runs for at least 30-40 seconds, you should increase the number of mapper tasks to
some multiple of the number of mapper slots in the cluster.
Don't schedule too many reduce tasks for most jobs, the number of reduce tasks equal to or a bit less
than the number of reduce slots in the cluster.

2.4. Combiner between mapper and reducer

lf your algorithm involves computing aggregates of any sort, it is suggested to use a Combiner to perform
some aggregation before the data hits the reducer. The MapReduce framework runs combine intelligently to
reduce the amount of data to be written to disk and that has to be transferred between the Map and Reduce
stages of computation.

2.5. Usage of most appropriate and compact writable type for data

Big data new users or users switching from Hadoop Streaming to Java MapReduce often use the Text
writable type unnecessarily. Although Text can be convenient, it's inefficient to convert numeric data to and
from UTF8 strings and can actually make up a significant portion of CPU time. Whenever dealing with non
textual data, consider using the binary Writables like IntWritable, FLoatwritable etc.

2.6. Reusage of Writables

One of the common mistakes that many MapReduce users make is to allocate a new Writable object for
every output from a mapper or reducer. For example, to implement a word-count mapper:

1. public void map(..) {

2.
3. for (String word : words){
4. output.collect(new Text(word), new IntWritable(1);
5.
This implementation causes allocation of thousands of short-lived objects. While Java garbage collector does
a reasonable job at dealing with this, it is more efficient to write:

1. class MyMapper ...{

2. Text wordText= new Text();
3. IntWritable one = new IntWritable(1);

public void map(...) {

5. .. for (String word: words)
6. {
7. wordText.set(word);
8. output.collect(word, one); }
9.3
10.
This is also of the
one
Hadoop job optimizing technique while Data flows in MapReduce.
12
Features of MapReduce

Simplicity- MapReduce jobs are easy to run. Applications can be written in any language such as java,
C++, and python.
Scalability - MapReduce can process petabytes of data.

Speed By means of parallel processing problems that take days to solve, it is solved in hours and
minutes by MapReduce.
Fault Tolerance MapReduce takes care of failures. If
-

one copy of data is unavailable, another machine

has a copy of the same key
pair which can be used for solving the same subtask.
Data Locality MapReduce tries to
place the data and the compute as close as
-

put the compute on the same node where data resides, if that cannot possible. First, it tries to
be done (due to reasons like
on that node is
down, compute on that node is performing some other compute
put the compute on the node nearest to the
computation, etc.), then it tries to
respective data node(s) which contains the data to be
processed. This feature of MapReduce is "Data Locality".
19
MapReduce- Word Count Example Fiow
Shuiie
input Split map combinel reduce output

SaL DW SQ
soL Ow soL
sQL SSIS
saLssASS
DW SISO issAS SSRS
DW sQLK

How does the MapReduce work?

The overaB MapReduce word count process Recordwiter

artitioning Final resutt
nputspit Mapping Shuftn9 Reducing
tn Sarting
Combining
RecordReader
ear 2

Car, 2 Car, 31
Car, 2 Dew
diaiowiirnnianin
e eer. 2

Bear, 1 R
( 1 E, 2
Rives 1

Output in a list of (Key, Value) Out put in a list of (Key, List of Vaues)
in the intermediate file in the intermediate file
ARN (Yet Another Resource Negotiator)

YARN was introduced in Hadoop 2.0. In Hadoop 1.0 a map-reduce job is run through a jab tracker and mut
task trackers. Job of job tracker is to monitor the progress of map-reduce job, handle the resource allocation ana
As single process is handling all these things, Hadoop 1.0 is not with scaling. Also it makes
good
scheduling etc.
Job tracker a single point of failure. In 1.0, you can run only map-reduce jobs with hadoop but with YARN support
for map and reduce tasks
in 2.0, you can run other jobs like streaming and graph processing. In 1.0 slots are fixed a
so while map is running you can't use reduce slots for map tasks because of that
slots go waste, in 2.0 there is
concept of container, which has resoureces like memory and cpu-cores and any task can be run in it.

YARN has basically these components:

Steps involved in running a job using YARN:

Application ResourceManager
client 1:submit
YARN
l i e n t node application resource manager node

2a:start container
3: allocate resources (heartbeat)
NodeManager
2b: launch|

container
Application
ProCe>> 4a: s t a r t NodeManager
container i 4b: launch |L
node manager node

Container
Application
Proess
node manager node

Resource Manager:
It has two main components: Job Scheduler and Application Manager. Job of scheduler is allocated the resources
with the given scheduling method and job of Application Manager is to monitor the progress of submitted
application like map-reduce job. It has all the info of available resources.
Node Manager:
For each node there is a node manager running. t maintains the available resources on that particular node and
notifies Resource Manager about the available resources when it starts. It launches the containers by providing the
needed resources (memory, cpu etc.). These resourcesare allocated to container by Resource Manager. It manages
the containers during it's lifetime. It sends heartbeat to Resource Manager to let it know that it is alive. In case
Resource Manager doesn't receive heartbeat from Node Manager, it marks that node as failure.
Application Master:
It carries out the execution of job using different compunents
of YARN. It is spawned under Node Manager under
the instructions of Resource Manager. One Application master is launched for each
job. For resource allocation it
talks to Resource Manager, for launching or stopping a container it talks to Node
of task from different nodes and notifies the status of job to client as client
Manager. It aggregates the status
polls on it. It also sends periodic
heartbeat to Resource Manager to make sure Resource manager can launch a new
failure. Application Master in case of
Container:
It is started by Node
Manager. It consists of resources like memory, cpu core etc. For
task,Application Master asks Resource Manager for resources using which a container canrunning
be run.
a map or reduce
Apache HBase
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the advent of big
data, companies realized the benefit of processing big data and started opting for solutions like Hadoop.
Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop excels in

storing and processing of huge data of various formats such as arbitrary, semi-, or even unstrucured.
Limitations of Hadoop
That means
adoop can perform only batch processing, and data will be accessed only in a sequential manner.

dataset when processed results in

one has to search the entire dataset even for the simp!est of jobs. A huge
At this point, a new solution is needed io
another huge data set, which should also be processed sequentially.
access any point of data in a single unit of time (random access).
Hadoop Random Access Databases of the databases that
HBase, Cassandra, couchDB, Dynamo, and MongoDB
are some
Applications such as

and the data in a random manner.

store huge amounts of data access

What is HBase
of the Hadoop file system. It is an open-source
HBase is distributed column-oriented database built on top
a
HBase is a data model that is similar 1
and is horizontally scalable. Its programming language is Java.
project data. It leverages the
random access to huge amounts of structured
Google's big table designed to provide quick
Hbase is well suited for sparse data sets which are
fault tolerance provided by the Hadoop File System (HDFS).
in practically any programming
data use cases. Hbase provides APls enabling development
very c o m m o n in big access to data in the
of the Hadoop ecosystem that provides random real-time read/write
language. It is a part
or through HBase. Data
consumer
store the data in HDFS either directly
Hadoop File System. One can File System and
HBase. HBase sits on top of the Hadoop
reads/accesses the data in HDFS randomly using

provides read and write access.

HBase

Data
Data
Consumer Producer

HDFS

Why HBase
slow as the data becomes large
RDBMS get exponentially
scale data.
It can easily work on extremely large
to fit in a well-defined schema
o Expects data to be highly structured, i.e. ability
semi-structured data types we can use it.
o For both structured and
a downtime
Any change in schema might require
of maintaining NULL values
For datasets, too much of overhead
sparse
architecture.
o Apache HBase has a completely distributed
which unprecedented high
results in write throughput.
o HBase offers high security and easy management
can be backed with HBase Tables.
Moreover, the MapReduce jobs
1
Where to Use HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on
top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after
File System, likewise
Google's Bigtable. Bigtable acts up on Google
Apache HBase works on top of and HDFS. Hadoop
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a
Oct 2007 Hadoop contribution.
The first usable HBase
along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of
Hadoop.
Oct 2008 HBase 0.18.1 was released.
Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.

In HBase, tables
HBase- Architecture
are split into regions and are served by the region servers. Regions
column families into "Stores". Stores
are vertically divided by
are saved as files in HDFS. Shown below is the architecture of HBase.
Note: The term 'store' is used for
regions to explain the storage structure.
HBase Architecture
adoo
HDES
HBASE

CIHents
ter

E egion servor

HBase Architecture is basically a column-oriented key-value data store and also it is the natural fit for
deploying as a top layer on HDFS because it works extremely fine with the kind of data that Hadoop process.
Moreover, when it comes to both read and write operations it is extremely fast and even it does not lose this
extremely important quality with humongous datasets.
There are 3 major components of HBase Architecture:
Master server
Region servers (can be added or removed as per requirement)
Zookeeper (client library)

MasterServer
master server
-

The
. Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the
regions to less occupied
servers.
D
Column family is a collection
of columns.
Column is a.collection of
key value pairs.
Given below is an
example schema of table in HBase.
Rowid Column Family Column Family Column Family Column Family
coll col2 col3 coll col2 col3 coll col2 col3 coll col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables
of data. as sections of columns of data, rather than
Shortly, they will have column families. as rows

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). |It is suitable for Online Analytical
Processing
Such databases are (OLAP).
designed for small number of rows Column-oriented databases
andcolumns.
are
designed for
huge tables.
The following image shows column families in a column-oriented database:

cOLUMN FAMILIES

pesona
empid narme ity designation salary
raju hyderabad manager |50.000
ravi chennai Sr.engineer 30.000
rajesh delhi ir.engineer 25,000

HBase and HDFS

HDFS HBase
HDFS is a distributed file system HBase is a database built on top ofthe HDFS.
suitable for storing large files.
HDFS does not support fast individual | HBase provides fast lookups for larger tables.

record lookups.
batch | It provides low latency access to single rows from billions of
It provides high latency
processing no concept of batch records (Random access).
processing
Hash tables and provides random access,
It provides only sequential access of | HBase internally uses
and it stores the data in indexed HDFS files for faster lookups.
data.

RDBMS
HBase
which
An RDBMS is governed by its schema,
| HBase is schema-less, it doesn't have the concept of fixed describes the whole structure of tables.
columns schema; defines only column families.
It is thin and built for small tables. Hard to scale.
It is built for wide tables. HBase is horizontaliy scalable. RDBMS is transactional.
| No transactions are there in HBase.
It will have normalized data.
It has de-normalized data. It is good for structured data.
| It is good for semi-structured as well as structured data.

Features of HBase
HBase is linearly scalable.
It has automatic failure support.
I t provides consistent read and writes.
.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provids data replication across clusters.
Horizontally scalable: You can add any number of columns anytime.
Automatic Failover: Automatic failover is a resouree that allows a system administrator to automatically
switch data handling to a standby system in the event of system compromise
Integrations with Map/Reduce framework: Al the commands and java codes internally implements Map/
Reduce to do the task and it is built over Hadoop Distributed File System.
Sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column key,
and timestamp.
Often referred as a key value store or column family-oriented database, or storing versioned maps of
maps.
Fundamentally, it's a platform for storing and retrieving data with random access.
It doesn't care about data types (storing an integer in one row and a string in another for the same
column).
!t doesn't enforce relationships within your data.
It is designed to run on a cluster of computers, built using commodity hardware.
I t hosts very large tables on top ofclusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google
File System; likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase
It is used whenever there is a need to write heavy
applications.
HBase is used whenever we need to provide fast random access to
available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Machine Learning Lab Viva
100% (1)
Machine Learning Lab Viva
9 pages
1.solar Wireless Electric Vehicle Charging System
67% (3)
1.solar Wireless Electric Vehicle Charging System
38 pages
Image Processing
No ratings yet
Image Processing
45 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
CAD/CAM Systems: o P & N Va
No ratings yet
CAD/CAM Systems: o P & N Va
16 pages
2000 Procedimientos Industriales - Formoso
100% (2)
2000 Procedimientos Industriales - Formoso
1,219 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
No ratings yet
(ET) Remote Utilities (Viewer + Host) Pro 6.8.0.1 TORRENT (v6.8.0
5 pages
Grameenphone Report
No ratings yet
Grameenphone Report
122 pages
Introduction To HVDC Architecture and Solutions For Control and Protection
No ratings yet
Introduction To HVDC Architecture and Solutions For Control and Protection
18 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
Saint Theresa College of Tandag, Inc.: Senior High School
No ratings yet
Saint Theresa College of Tandag, Inc.: Senior High School
5 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Intrinsic & Extrinsic Semiconductors
No ratings yet
Intrinsic & Extrinsic Semiconductors
20 pages
Aiwa Av-D58-U SM PDF
No ratings yet
Aiwa Av-D58-U SM PDF
34 pages
B8300030511 Maa Eng
No ratings yet
B8300030511 Maa Eng
18 pages
eSthenos-Mobility Solutions For MFI/Banks/SBL
No ratings yet
eSthenos-Mobility Solutions For MFI/Banks/SBL
8 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Compaq Armada m300
No ratings yet
Compaq Armada m300
102 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Inomax Manual de Operacion PDF
No ratings yet
Inomax Manual de Operacion PDF
136 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Apache Hadoop: Google File System Hadoop Distributed File System
No ratings yet
Apache Hadoop: Google File System Hadoop Distributed File System
2 pages
Processor Architecture
No ratings yet
Processor Architecture
25 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Parallelism in A Uniprocessor System: Multiprogramming
No ratings yet
Parallelism in A Uniprocessor System: Multiprogramming
2 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
BDA GTU Study Material Presentations Unit-2 14082021084043PM
No ratings yet
BDA GTU Study Material Presentations Unit-2 14082021084043PM
72 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Module III Note
No ratings yet
Module III Note
36 pages
11 - Conditional Control Structure
No ratings yet
11 - Conditional Control Structure
8 pages
Unit 2
No ratings yet
Unit 2
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
TK Series Magnet GPS Tracker USER MANUAL
No ratings yet
TK Series Magnet GPS Tracker USER MANUAL
26 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
Fixlog
No ratings yet
Fixlog
108 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
ECOFLOW RIVER 2 - User Manual (EU-EN) V1.0 - 1673942476392-20230127
No ratings yet
ECOFLOW RIVER 2 - User Manual (EU-EN) V1.0 - 1673942476392-20230127
16 pages
What Is Epub 3 Matt Garrish
No ratings yet
What Is Epub 3 Matt Garrish
29 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit 3
No ratings yet
Unit 3
18 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit 2
No ratings yet
Unit 2
21 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Aditee ORMB Testing Resume (1) - 1-1
No ratings yet
Aditee ORMB Testing Resume (1) - 1-1
3 pages
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
No ratings yet
Elektor-1982-07 (Super LN Phono, Class A+B Amplifier)
97 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Unit 2
No ratings yet
Unit 2
28 pages
LOCOS-fabrication Unit 2
No ratings yet
LOCOS-fabrication Unit 2
39 pages
Hadoop
No ratings yet
Hadoop
31 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit 3
No ratings yet
Unit 3
44 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
OITAF2024 AURO v2-LOW
No ratings yet
OITAF2024 AURO v2-LOW
42 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Module II
No ratings yet
Module II
46 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
41 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
CL Back Propogation
No ratings yet
CL Back Propogation
11 pages
HCIP Data Center Facility Deployment
No ratings yet
HCIP Data Center Facility Deployment
8 pages
Unit-2 CH 1 Updated
No ratings yet
Unit-2 CH 1 Updated
22 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Reporting VS Analysis
No ratings yet
Reporting VS Analysis
9 pages
Ensemble Method
No ratings yet
Ensemble Method
8 pages
D Clustering
No ratings yet
D Clustering
8 pages
Arithmetic 2 Teacher Edition
No ratings yet
Arithmetic 2 Teacher Edition
8 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
PREPOSITIONS OF PLACE - Quizizz
No ratings yet
PREPOSITIONS OF PLACE - Quizizz
6 pages
Synposis FInal 2
No ratings yet
Synposis FInal 2
20 pages
Ensemble Method
No ratings yet
Ensemble Method
18 pages
Soln Numerical Methods Practice Questions MSBTE
No ratings yet
Soln Numerical Methods Practice Questions MSBTE
24 pages
Unit III
No ratings yet
Unit III
32 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
HDFS
No ratings yet
HDFS
46 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

BD Sec B

Uploaded by

BD Sec B

Uploaded by

Apache Hadoop

The Apache Hadoop framework is composed of the following modules

for parallel processing of large data sets. lt is a software

MapReduce layer JobTracker

HDFS layer NameNode

Client Block Oos

Tasks of HDFS DataNode

HDFS Read/Write Operation

HDES reate Distributed -2create_ NameNode

3.9.2. Read Operation

Open Distributed 2.***********

Rack Awareness in HDFS Architecture

Rack Awareness is important to improve:

Data high availability and reliability.

MapReduCe Job Execution Flow

npitsplt oaReade Combiner Partitloner

nputsplit ombiner Partitioner Reduce

Key value nte

MapReduce Flow Chart

How does Hadoop Mapper work?

How many map tasks in Hadoop? the

MapReduce wordcount example in execution:

Input Bus 1 Lion 2

Hadoop Optimization or Job Optimization Techniques

2.1. Proper configuration of your cluster

2.2. LZO compression usage

2.3. Proper tuning of the number of MapReduce tasks

of mapper or reducer process involves following things: first, you

2.4. Combiner between mapper and reducer

2.6. Reusage of Writables

1. public void map(..) {

1. class MyMapper ...{

public void map(...) {

one copy of data is unavailable, another machine

How does the MapReduce work?

The overaB MapReduce word count process Recordwiter

YARN has basically these components:

Steps involved in running a job using YARN:

dataset when processed results in

and the data in a random manner.

provides read and write access.

Column Oriented and Row Oriented

Row-Oriented Database Column-Oriented Database

HBase and HDFS

You might also like