0% found this document useful (0 votes)

12 views69 pages

Session 1 - Introduction To Hadoop

The document provides an introduction to Hadoop and MapReduce programming, covering distributed systems, the Google File System (GFS), and the components of Hadoop including HDFS and YARN. It discusses the motivation behind Hadoop 2.x and its improvements, as well as the tools available in the Hadoop ecosystem. The sessions are structured to progressively build knowledge on the architecture and functionalities of Hadoop and its related technologies.

Uploaded by

Gaurav Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views69 pages

Session 1 - Introduction To Hadoop

Uploaded by

Gaurav Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

Introduction to Hadoop and

MapReduce Programming -
Session 1

1
Course: Data Engineering - I
Edit Master text styles
Lecture On: Introduction to
Hadoop
Instructor: Vishwa Mohan

2
Segment - 01
Module Introduction

3
Module Introduction

Session 1 Session 2 Session 3

• Introduction to distributed • File storage in HDFS • Introduction to the

systems • Demonstration of basic MapReduce Framework
• GFS and MapReduce commands in HDFS • Basic implementation of
• Introduction to Hadoop • Write operation and rack MapReduce using Python
and components of awareness • Hadoop streaming with
HDFS • Read operation in HDFS demonstration
• Hadoop 2.x • Features and limitations of • The Combiner
• YARN and task HDFS • The Partitioner
processing • Job scheduling and fault
• Tools for Hadoop tolerance in MapReduce
Segment - 02
Introduction: Introduction to
Hadoop

5
Session Overview

1 Introduction to Distributed Systems

2 Introduction to GFS and MapReduce

3 Introduction to Hadoop and Hadoop 2.x Along

With Its Features

4 YARN and Task Processing

5 Tools for Hadoop

6
Segment - 03
Introduction to Distributed
Systems

7
Segment Overview

Introduction to distributed
systems

The need for distributed systems

and their features

8
Introduction to Distributed Systems

Why are distributed systems needed?

● A single store of Coffee-house

may take 100s of order each
day
● If suppose there are 10 stores in
a small city, data grows more or
less proportionately.
● At the global level, if
Coffee-house has 10000 stores
across the world, the amount of
data generated and processed
can’t be handled by a single
machine.

9
Introduction to Distributed Systems

A distributed system
is a system in which
several independent
computers are
connected to each
other over a network
with middleware to look
like a single machine.

10
Introduction to Distributed Systems

11
Introduction to Distributed Systems

Primary Challenges faced by Distributed Systems

● Performance - Also includes challenges like communication fault delay or computation fault
delay.
● Fault tolerance - Tolerate faults and function normally.
● Scalability - System must remain effective even under heavy load.
● Security - Probable threats like information leakage, integrity violation, denial of services and
illegitimate use
● Concurrency - Shared access to resources must be made available to the required processes.
● Migration - Tasks should be allowed to move within the system without affecting other
operations.
● Load balancing - Load must be distributed among available resources for better performance.

12
Segment Summary

Introduced to the concept of

distributed systems

Discussed why they are needed

and what their features are

13
Segment - 04
Introduction to GFS and
MapReduce

14
Segment Overview

GFS and its master/slave

architecture

Considerations for GFS and the

origin of MapReduce

15
Introduction to GFS and MapReduce

Today, on average, Google handles...

16
Introduction to GFS and MapReduce

GFS is a proprietary
distributed file system made
by Google for storing and
processing large amounts of
data in multiple commodity
machines in large clusters,
which ensures reliability and
scalability and also maintains
efficiency.

Google File System (GFS)

17
Introduction to GFS and MapReduce

Considerations made while developing GFS and their effects

Uses commodity hardware

Can be easily scaled horizontally

Commodity hardware always fails

Fault tolerance and automatic

error recovery are crucial
18
Introduction to GFS and MapReduce

How GFS solved the problems of traditional

Distributed systems?

● Very high availability and fault tolerance maintained through replication

● Automatic and efficient data recovery
● High aggregate throughput
● Data integrity is ensured as each chunkserver verifies integrity of their own copies using
checksums
● Each chunk server is constantly monitored by GFS master
● Modularity allows GFS to easily expand to account for increased loads
● Master Node ensures load balancing is maintained

19
Introduction to GFS and MapReduce

20
Introduction to GFS and MapReduce

Google in their paper on MapReduce was then used

“MapReduce: Simplified in the Apache Nutch
Data Processing on Large project. It was highly
Clusters” first introduced scalable and supported
MapReduce, a programming unstructured data.
model and an associated MapReduce jobs were able
implementation for to withstand hardware
processing & generating failures, thus ensuring
large data sets fault-tolerance.

21
Introduction to GFS and MapReduce

Google implemented GFS in Later on, Development of

their white paper titled “The Hadoop started as a
Google File System” and sub-project of Apache
defined it as a scalable Nutch when the Apache
distributed file system for community realised that the
large distributed implementation of
data-intensive applications MapReduce and Nutch DFS
could be used for other tasks
as well.

22
Segment Summary

Discussed GFS and its

master/slave architecture

Discussed the considerations

made while using GFS and the
origin of MapReduce

23
Segment - 05
Introduction to Hadoop

24
Segment Overview

Introduction to Hadoop and its

components

HDFS, its components and basic

architecture

25
Introduction to Hadoop

Hadoop is an open
source framework that
is used for storing and
processing big data on
the clusters of
commodity hardware.

26
Introduction to Hadoop

Hadoop contains two main components: The

Hadoop Distributed File System (HDFS) and
MapReduce

27
Introduction to Hadoop
Hadoop contains two main components: The
Hadoop Distributed File System (HDFS) and
MapReduce

28
Introduction to Hadoop

Hadoop contains two main components: The

Hadoop Distributed File System (HDFS) and
MapReduce

29
Introduction to Hadoop

HDFS has three main components: The NameNode,

the Secondary NameNode and DataNodes

30
Introduction to Hadoop

HDFS has three main components: The NameNode,

the Secondary NameNode and DataNodes

31
Introduction to Hadoop

HDFS has three main components: The NameNode,

the Secondary NameNode and DataNodes

32
Introduction to Hadoop

HDFS has three main components: The NameNode,

the Secondary NameNode and DataNodes

33
Segment Summary

Introduced to Hadoop and its

components

Discussed HDFS, its components

and architecture

34
Segment - 06
Hadoop 2.x

35
Segment Overview

Motivation behind the

development of Hadoop 2.x

Introduction to Hadoop 2.x and its

improvements

36
Hadoop 2.x
Main motivation for developing Hadoop 2.x

37
Hadoop 2.x

38
Hadoop 2.x

39
Hadoop 2.x

Improvements in Hadoop 3.x

● The runtime version requirement of Java 8

● Erasure coding support
● Improved YARN Timeline Service v.2
● Support for more than two NameNodes
● Support for GPUs in YARN
● Intra-node disk balancing.

40
Segment Summary

Discussed the motivation for

Hadoop 2.x

Introduced Hadoop 2.x and its

improvements

41
Segment - 07
YARN

42
Segment Overview

Introduction to YARN

Components of YARN

43
YARN

Components of YARN
YARN was introduced in
Hadoop 2.x mainly to split Resource Manager
the responsibilities of the
JobTracker in Hadoop 1.x
into separate components, Node Manager
thus improving the
scalability and reliability of
the whole system. Application Master

Containers

44
YARN
Containers

45
YARN
Resource Manager

46
YARN
Node Manager

47
YARN
Application Master

48
Segment Summary

Introduced YARN

Discussed the different

components of YARN

49
Segment - 08
Task Processing in Hadoop

50
Segment Overview

A recap of YARN components

and their functions

Task processing in Hadoop

51
Task Processing in Hadoop

A quick recap of the components of YARN

Resource Manager
● One per cluster
● Runs in the master node
Application Master
● Installed in slave nodes
● Monitors jobs and requests containers from the Scheduler
Node Manager
● Installed in slave nodes
● Launches and monitors containers
Containers
● Consist of processors and memory hardware required for processing tasks

52
Task Processing in Hadoop

53
Task Processing in Hadoop

54
Task Processing in Hadoop

55
Task Processing in Hadoop

56
Segment Summary

Recap of the components of

YARN

Discussed the task processing

steps of YARN

57
Segment - 09
Tools for Hadoop

58
Segment Overview

Discuss the various tools used in

the Hadoop Ecosystem

59
Tools for Hadoop

60
Tools for Hadoop

61
Tools for Hadoop

62
Tools for Hadoop

63
Tools for Hadoop

64
Tools for Hadoop

65
Tools for Hadoop

66
Segment Summary

Discussed the various tools used

in the Hadoop Ecosystem

67
Session Summary

1 Introduced to distributed systems and the need

for the same

Introduced to GFS and MapReduce on which

2 Hadoop is based

Introduced Hadoop and Hadoop 2.x along with

3 their features and the basic architecture

4 Introduced the concept of YARN and task

processing in Hadoop

Discussed the various tools used in the Hadoop

5 Ecosystem
68
Thank you

Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
BDA Complete Notes
100% (1)
BDA Complete Notes
88 pages
Hadoop
No ratings yet
Hadoop
34 pages
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
No ratings yet
Ccs335 CC Unit IV Cloud Computing Unit 4 Notes
42 pages
Sns College of Engineering: Big Data Analytics
No ratings yet
Sns College of Engineering: Big Data Analytics
17 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
NPTEL CC Assignment3-1
100% (5)
NPTEL CC Assignment3-1
4 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Cloud Computing IIT Kanpur PDF
No ratings yet
Cloud Computing IIT Kanpur PDF
123 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
HADOOP
No ratings yet
HADOOP
19 pages
Fujitsu Prime Cluster
No ratings yet
Fujitsu Prime Cluster
665 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
DC - Co 1 All in 1 PDF
No ratings yet
DC - Co 1 All in 1 PDF
197 pages
Dcs Mcqs
No ratings yet
Dcs Mcqs
31 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Handwritten Character Recognition System
No ratings yet
Handwritten Character Recognition System
81 pages
Week 02
No ratings yet
Week 02
115 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
CC Unit 4
No ratings yet
CC Unit 4
46 pages
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
No ratings yet
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
87 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
2021 - CC - ASSIGN - 8weeks Course
No ratings yet
2021 - CC - ASSIGN - 8weeks Course
34 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Unit 5
No ratings yet
Unit 5
101 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
Chapter 3 - Data Storage and Processing Systems
No ratings yet
Chapter 3 - Data Storage and Processing Systems
108 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
44 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Part2 HDFS
No ratings yet
Part2 HDFS
33 pages
Cloud Computing Assignment-Week 0 10 Total Mark: 10 X 1 10
No ratings yet
Cloud Computing Assignment-Week 0 10 Total Mark: 10 X 1 10
34 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Bluej Project
No ratings yet
Bluej Project
117 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
BDT Unit03.pptx
No ratings yet
BDT Unit03.pptx
93 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Introduction-to-Hadoop-Ecosystem
No ratings yet
Introduction-to-Hadoop-Ecosystem
26 pages
Hadoop Introduction
No ratings yet
Hadoop Introduction
29 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Bigtable: A Distributed Storage System For Structured Data
No ratings yet
Bigtable: A Distributed Storage System For Structured Data
26 pages
Unit 3
No ratings yet
Unit 3
18 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
CS621 Week 15
No ratings yet
CS621 Week 15
64 pages
Hero
No ratings yet
Hero
10 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Google File System (GFS)
No ratings yet
Google File System (GFS)
18 pages
Hadoop
No ratings yet
Hadoop
25 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Scalability in Distributed Systems
No ratings yet
Scalability in Distributed Systems
12 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Chubby Paper
No ratings yet
Chubby Paper
12 pages
Google File System
No ratings yet
Google File System
6 pages
Chapter 6 Case Study Hadoop
No ratings yet
Chapter 6 Case Study Hadoop
39 pages
Hadoop
No ratings yet
Hadoop
7 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Bba Unit-1
No ratings yet
Bba Unit-1
11 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Big Data Basics
No ratings yet
Big Data Basics
15 pages
2019 SCC311 Questions
No ratings yet
2019 SCC311 Questions
9 pages
Google App Engine
No ratings yet
Google App Engine
9 pages
Big Data Analysis Using Hadoop: A Survey: August 2015
No ratings yet
Big Data Analysis Using Hadoop: A Survey: August 2015
6 pages
An Overview of Google File System (GFS) - Medium
No ratings yet
An Overview of Google File System (GFS) - Medium
10 pages
Unit 4
No ratings yet
Unit 4
8 pages
Heaven Aviation Academy Pvt. LTD
No ratings yet
Heaven Aviation Academy Pvt. LTD
8 pages
Presentation 7
No ratings yet
Presentation 7
7 pages
Unit 3.4 Gfs and Hdfs
No ratings yet
Unit 3.4 Gfs and Hdfs
4 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Cloud Computing Answers
No ratings yet
Cloud Computing Answers
4 pages
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
From Everand
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pop!_OS System Administration Guide: Definitive Reference for Developers and Engineers
From Everand
Pop!_OS System Administration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Session 1 - Introduction To Hadoop

Uploaded by

Session 1 - Introduction To Hadoop

Uploaded by

Introduction to Hadoop and

Session 1 Session 2 Session 3

• Introduction to distributed • File storage in HDFS • Introduction to the

1 Introduction to Distributed Systems

2 Introduction to GFS and MapReduce

3 Introduction to Hadoop and Hadoop 2.x Along

4 YARN and Task Processing

5 Tools for Hadoop

The need for distributed systems

Why are distributed systems needed?

● A single store of Coffee-house

Primary Challenges faced by Distributed Systems

Introduced to the concept of

Discussed why they are needed

GFS and its master/slave

Considerations for GFS and the

Today, on average, Google handles...

Google File System (GFS)

Considerations made while developing GFS and their effects

Uses commodity hardware

Can be easily scaled horizontally

Commodity hardware always fails

Fault tolerance and automatic

How GFS solved the problems of traditional

● Very high availability and fault tolerance maintained through replication

Google in their paper on MapReduce was then used

Google implemented GFS in Later on, Development of

Discussed GFS and its

Discussed the considerations

Introduction to Hadoop and its

HDFS, its components and basic

Hadoop contains two main components: The

Hadoop contains two main components: The

HDFS has three main components: The NameNode,

HDFS has three main components: The NameNode,

HDFS has three main components: The NameNode,

HDFS has three main components: The NameNode,

Introduced to Hadoop and its

Discussed HDFS, its components

Motivation behind the

Introduction to Hadoop 2.x and its

Improvements in Hadoop 3.x

● The runtime version requirement of Java 8

Discussed the motivation for

Introduced Hadoop 2.x and its

Discussed the different

A recap of YARN components

Task processing in Hadoop

A quick recap of the components of YARN

Recap of the components of

Discussed the task processing

Discuss the various tools used in

Discussed the various tools used

1 Introduced to distributed systems and the need

Introduced to GFS and MapReduce on which

Introduced Hadoop and Hadoop 2.x along with

4 Introduced the concept of YARN and task

Discussed the various tools used in the Hadoop

You might also like