0% found this document useful (0 votes)
8 views

Session+1+-+Introduction+to+Hadoop

The document provides an introduction to Hadoop and MapReduce programming, covering distributed systems, the Google File System (GFS), and the components of Hadoop including HDFS and YARN. It discusses the motivation behind Hadoop 2.x and its improvements, as well as the tools available in the Hadoop ecosystem. The sessions are structured to progressively build knowledge on the architecture and functionalities of Hadoop and its related technologies.

Uploaded by

Gaurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Session+1+-+Introduction+to+Hadoop

The document provides an introduction to Hadoop and MapReduce programming, covering distributed systems, the Google File System (GFS), and the components of Hadoop including HDFS and YARN. It discusses the motivation behind Hadoop 2.x and its improvements, as well as the tools available in the Hadoop ecosystem. The sessions are structured to progressively build knowledge on the architecture and functionalities of Hadoop and its related technologies.

Uploaded by

Gaurav Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to Hadoop and

MapReduce Programming -
Session 1

1
Course: Data Engineering - I
Edit Master text styles
Lecture On: Introduction to
Hadoop
Instructor: Vishwa Mohan

2
Segment - 01
Module Introduction

3
Module Introduction

Session 1 Session 2 Session 3

• Introduction to distributed • File storage in HDFS • Introduction to the


systems • Demonstration of basic MapReduce Framework
• GFS and MapReduce commands in HDFS • Basic implementation of
• Introduction to Hadoop • Write operation and rack MapReduce using Python
and components of awareness • Hadoop streaming with
HDFS • Read operation in HDFS demonstration
• Hadoop 2.x • Features and limitations of • The Combiner
• YARN and task HDFS • The Partitioner
processing • Job scheduling and fault
• Tools for Hadoop tolerance in MapReduce
Segment - 02
Introduction: Introduction to
Hadoop

5
Session Overview

1 Introduction to Distributed Systems

2 Introduction to GFS and MapReduce

3 Introduction to Hadoop and Hadoop 2.x Along


With Its Features

4 YARN and Task Processing

5 Tools for Hadoop


6
Segment - 03
Introduction to Distributed
Systems

7
Segment Overview

Introduction to distributed
systems

The need for distributed systems


and their features

8
Introduction to Distributed Systems

Why are distributed systems needed?

● A single store of Coffee-house


may take 100s of order each
day
● If suppose there are 10 stores in
a small city, data grows more or
less proportionately.
● At the global level, if
Coffee-house has 10000 stores
across the world, the amount of
data generated and processed
can’t be handled by a single
machine.

9
Introduction to Distributed Systems

A distributed system
is a system in which
several independent
computers are
connected to each
other over a network
with middleware to look
like a single machine.

10
Introduction to Distributed Systems

11
Introduction to Distributed Systems

Primary Challenges faced by Distributed Systems

● Performance - Also includes challenges like communication fault delay or computation fault
delay.
● Fault tolerance - Tolerate faults and function normally.
● Scalability - System must remain effective even under heavy load.
● Security - Probable threats like information leakage, integrity violation, denial of services and
illegitimate use
● Concurrency - Shared access to resources must be made available to the required processes.
● Migration - Tasks should be allowed to move within the system without affecting other
operations.
● Load balancing - Load must be distributed among available resources for better performance.

12
Segment Summary

Introduced to the concept of


distributed systems

Discussed why they are needed


and what their features are

13
Segment - 04
Introduction to GFS and
MapReduce

14
Segment Overview

GFS and its master/slave


architecture

Considerations for GFS and the


origin of MapReduce

15
Introduction to GFS and MapReduce

Today, on average, Google handles...

16
Introduction to GFS and MapReduce

GFS is a proprietary
distributed file system made
by Google for storing and
processing large amounts of
data in multiple commodity
machines in large clusters,
which ensures reliability and
scalability and also maintains
efficiency.

Google File System (GFS)

17
Introduction to GFS and MapReduce

Considerations made while developing GFS and their effects

Uses commodity hardware

Can be easily scaled horizontally

Commodity hardware always fails

Fault tolerance and automatic


error recovery are crucial
18
Introduction to GFS and MapReduce

How GFS solved the problems of traditional


Distributed systems?

● Very high availability and fault tolerance maintained through replication


● Automatic and efficient data recovery
● High aggregate throughput
● Data integrity is ensured as each chunkserver verifies integrity of their own copies using
checksums
● Each chunk server is constantly monitored by GFS master
● Modularity allows GFS to easily expand to account for increased loads
● Master Node ensures load balancing is maintained

19
Introduction to GFS and MapReduce

20
Introduction to GFS and MapReduce

Google in their paper on MapReduce was then used


“MapReduce: Simplified in the Apache Nutch
Data Processing on Large project. It was highly
Clusters” first introduced scalable and supported
MapReduce, a programming unstructured data.
model and an associated MapReduce jobs were able
implementation for to withstand hardware
processing & generating failures, thus ensuring
large data sets fault-tolerance.

21
Introduction to GFS and MapReduce

Google implemented GFS in Later on, Development of


their white paper titled “The Hadoop started as a
Google File System” and sub-project of Apache
defined it as a scalable Nutch when the Apache
distributed file system for community realised that the
large distributed implementation of
data-intensive applications MapReduce and Nutch DFS
could be used for other tasks
as well.

22
Segment Summary

Discussed GFS and its


master/slave architecture

Discussed the considerations


made while using GFS and the
origin of MapReduce

23
Segment - 05
Introduction to Hadoop

24
Segment Overview

Introduction to Hadoop and its


components

HDFS, its components and basic


architecture

25
Introduction to Hadoop

Hadoop is an open
source framework that
is used for storing and
processing big data on
the clusters of
commodity hardware.

26
Introduction to Hadoop

Hadoop contains two main components: The


Hadoop Distributed File System (HDFS) and
MapReduce

27
Introduction to Hadoop
Hadoop contains two main components: The
Hadoop Distributed File System (HDFS) and
MapReduce

28
Introduction to Hadoop

Hadoop contains two main components: The


Hadoop Distributed File System (HDFS) and
MapReduce

29
Introduction to Hadoop

HDFS has three main components: The NameNode,


the Secondary NameNode and DataNodes

30
Introduction to Hadoop

HDFS has three main components: The NameNode,


the Secondary NameNode and DataNodes

31
Introduction to Hadoop

HDFS has three main components: The NameNode,


the Secondary NameNode and DataNodes

32
Introduction to Hadoop

HDFS has three main components: The NameNode,


the Secondary NameNode and DataNodes

33
Segment Summary

Introduced to Hadoop and its


components

Discussed HDFS, its components


and architecture

34
Segment - 06
Hadoop 2.x

35
Segment Overview

Motivation behind the


development of Hadoop 2.x

Introduction to Hadoop 2.x and its


improvements

36
Hadoop 2.x
Main motivation for developing Hadoop 2.x

37
Hadoop 2.x

38
Hadoop 2.x

39
Hadoop 2.x

Improvements in Hadoop 3.x

● The runtime version requirement of Java 8


● Erasure coding support
● Improved YARN Timeline Service v.2
● Support for more than two NameNodes
● Support for GPUs in YARN
● Intra-node disk balancing.

40
Segment Summary

Discussed the motivation for


Hadoop 2.x

Introduced Hadoop 2.x and its


improvements

41
Segment - 07
YARN

42
Segment Overview

Introduction to YARN

Components of YARN

43
YARN

Components of YARN
YARN was introduced in
Hadoop 2.x mainly to split Resource Manager
the responsibilities of the
JobTracker in Hadoop 1.x
into separate components, Node Manager
thus improving the
scalability and reliability of
the whole system. Application Master

Containers

44
YARN
Containers

45
YARN
Resource Manager

46
YARN
Node Manager

47
YARN
Application Master

48
Segment Summary

Introduced YARN

Discussed the different


components of YARN

49
Segment - 08
Task Processing in Hadoop

50
Segment Overview

A recap of YARN components


and their functions

Task processing in Hadoop

51
Task Processing in Hadoop

A quick recap of the components of YARN

Resource Manager
● One per cluster
● Runs in the master node
Application Master
● Installed in slave nodes
● Monitors jobs and requests containers from the Scheduler
Node Manager
● Installed in slave nodes
● Launches and monitors containers
Containers
● Consist of processors and memory hardware required for processing tasks

52
Task Processing in Hadoop

53
Task Processing in Hadoop

54
Task Processing in Hadoop

55
Task Processing in Hadoop

56
Segment Summary

Recap of the components of


YARN

Discussed the task processing


steps of YARN

57
Segment - 09
Tools for Hadoop

58
Segment Overview

Discuss the various tools used in


the Hadoop Ecosystem

59
Tools for Hadoop

60
Tools for Hadoop

61
Tools for Hadoop

62
Tools for Hadoop

63
Tools for Hadoop

64
Tools for Hadoop

65
Tools for Hadoop

66
Segment Summary

Discussed the various tools used


in the Hadoop Ecosystem

67
Session Summary

1 Introduced to distributed systems and the need


for the same

Introduced to GFS and MapReduce on which


2 Hadoop is based

Introduced Hadoop and Hadoop 2.x along with


3 their features and the basic architecture

4 Introduced the concept of YARN and task


processing in Hadoop

Discussed the various tools used in the Hadoop


5 Ecosystem
68
Thank you

You might also like