Session+1+-+Introduction+to+Hadoop
Session+1+-+Introduction+to+Hadoop
MapReduce Programming -
Session 1
1
Course: Data Engineering - I
Edit Master text styles
Lecture On: Introduction to
Hadoop
Instructor: Vishwa Mohan
2
Segment - 01
Module Introduction
3
Module Introduction
5
Session Overview
7
Segment Overview
Introduction to distributed
systems
8
Introduction to Distributed Systems
9
Introduction to Distributed Systems
A distributed system
is a system in which
several independent
computers are
connected to each
other over a network
with middleware to look
like a single machine.
10
Introduction to Distributed Systems
11
Introduction to Distributed Systems
● Performance - Also includes challenges like communication fault delay or computation fault
delay.
● Fault tolerance - Tolerate faults and function normally.
● Scalability - System must remain effective even under heavy load.
● Security - Probable threats like information leakage, integrity violation, denial of services and
illegitimate use
● Concurrency - Shared access to resources must be made available to the required processes.
● Migration - Tasks should be allowed to move within the system without affecting other
operations.
● Load balancing - Load must be distributed among available resources for better performance.
12
Segment Summary
13
Segment - 04
Introduction to GFS and
MapReduce
14
Segment Overview
15
Introduction to GFS and MapReduce
16
Introduction to GFS and MapReduce
GFS is a proprietary
distributed file system made
by Google for storing and
processing large amounts of
data in multiple commodity
machines in large clusters,
which ensures reliability and
scalability and also maintains
efficiency.
17
Introduction to GFS and MapReduce
19
Introduction to GFS and MapReduce
20
Introduction to GFS and MapReduce
21
Introduction to GFS and MapReduce
22
Segment Summary
23
Segment - 05
Introduction to Hadoop
24
Segment Overview
25
Introduction to Hadoop
Hadoop is an open
source framework that
is used for storing and
processing big data on
the clusters of
commodity hardware.
26
Introduction to Hadoop
27
Introduction to Hadoop
Hadoop contains two main components: The
Hadoop Distributed File System (HDFS) and
MapReduce
28
Introduction to Hadoop
29
Introduction to Hadoop
30
Introduction to Hadoop
31
Introduction to Hadoop
32
Introduction to Hadoop
33
Segment Summary
34
Segment - 06
Hadoop 2.x
35
Segment Overview
36
Hadoop 2.x
Main motivation for developing Hadoop 2.x
37
Hadoop 2.x
38
Hadoop 2.x
39
Hadoop 2.x
40
Segment Summary
41
Segment - 07
YARN
42
Segment Overview
Introduction to YARN
Components of YARN
43
YARN
Components of YARN
YARN was introduced in
Hadoop 2.x mainly to split Resource Manager
the responsibilities of the
JobTracker in Hadoop 1.x
into separate components, Node Manager
thus improving the
scalability and reliability of
the whole system. Application Master
Containers
44
YARN
Containers
45
YARN
Resource Manager
46
YARN
Node Manager
47
YARN
Application Master
48
Segment Summary
Introduced YARN
49
Segment - 08
Task Processing in Hadoop
50
Segment Overview
51
Task Processing in Hadoop
Resource Manager
● One per cluster
● Runs in the master node
Application Master
● Installed in slave nodes
● Monitors jobs and requests containers from the Scheduler
Node Manager
● Installed in slave nodes
● Launches and monitors containers
Containers
● Consist of processors and memory hardware required for processing tasks
52
Task Processing in Hadoop
53
Task Processing in Hadoop
54
Task Processing in Hadoop
55
Task Processing in Hadoop
56
Segment Summary
57
Segment - 09
Tools for Hadoop
58
Segment Overview
59
Tools for Hadoop
60
Tools for Hadoop
61
Tools for Hadoop
62
Tools for Hadoop
63
Tools for Hadoop
64
Tools for Hadoop
65
Tools for Hadoop
66
Segment Summary
67
Session Summary