0% found this document useful (0 votes)

69 views44 pages

Introduction To Hadoop

The document provides an agenda for a lecture on Hadoop that will last between 120 to 150 minutes. The agenda covers introducing Hadoop and its key components like HDFS and MapReduce. It will discuss the architecture and workings of HDFS including file reads, writes and replication. It will also cover MapReduce programming model and an example word count program. The lecture will explain the limitations of original Hadoop 1.0 architecture and how YARN was developed to overcome these. Finally, it will discuss some other tools in the Hadoop ecosystem like Pig, Hive, Sqoop and HBase.

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views44 pages

Introduction To Hadoop

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Chapter 5

Introduction to Hadoop
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to Hadoop

1. To study the features of a) To comprehend the reasons

Hadoop. behind the popularity of
Hadoop.
2. To learn the basic concepts of
HDFS and MapReduce b) To be able to perform HDFS
Programming. operations.

3. To study HDFS Architecture. c) To comprehend MapReduce

framework.
4. To study MapReduce
Programming Model d) To understand the read and
write in HDFS.
5. To study Hadoop Ecosystem.
e) To be able to understand
Hadoop Ecosystem.
Session Plan

Lecture time 120 to 150 minutes

Q/A 15 minutes
Agenda
 Hadoop - An Introduction
 RDBMS versus Hadoop
 Distributed Computing Challenges
 History of Hadoop
 Hadoop Overview
 Key Aspects of Hadoop
 Hadoop Components
 High Level Architecture of Hadoop
 Use case for Hadoop
 ClickStream Data
 Hadoop Distributors
 HDFS
 HDFS Daemons
 Anatomy of File Read
 Anatomy of File Write
 Replica Placement Strategy
 Working with HDFS commands
 Special Features of HDFS
Agenda

 Processing Data with Hadoop

 What is MapReduce Programming?
 How does MapReduce Works?
 MapReduce Word Count Example

 Managing Resources and Application with Hadoop YARN

 Limitations of Hadoop 1.0 Architecture
 Hadoop 2 YARN: Taking Hadoop Beyond Batch

 Hadoop Ecosystem
 Pig
 Hive
 Sqoop
 HBase
Hadoop – An Introduction
What is Hadoop

Hadoop is:
Ever wondered why Hadoop has been and is one of the most wanted
technologies!!

The key consideration (the rationale behind its huge popularity) is:

Its capability to handle massive amounts of data, different

categories of data – fairly quickly.

The other considerations are :

RDBMS versus HADOOP
RDBMS versus HADOOP
Distributed Computing Challenges
Distributed Computing Challenges

• Hardware Failure

• How to Process This Gigantic Store of Data?

History of Hadoop
History of Hadoop
Hadoop Overview
Key Aspects of Hadoop
Hadoop Components
Hadoop Components

Hadoop Core Components:

HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.

MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop High Level Architecture
Use case for Hadoop
ClickStream Data Analysis

ClickStream data (mouse clicks) helps you to understand the purchasing

behavior of customers. ClickStream analysis helps online marketers to
optimize their product web pages, promotional content, etc. to
improve their business.
Hadoop Distributors
Hadoop Distributors
HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)
Hadoop Distributed File System
1. Storage component of Hadoop.

2. Distributed File System.

3. Modeled after Google File System.

4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).

5. You can replicate a file for a configured number of times, which is

tolerant in terms of both software and hardware.

6. Re-replicates data blocks automatically on nodes that have failed.

7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).

8. Sits on top of native file system such as ext3 and ext4, which is
described
HDFS Daemons

NameNode:

• Single NameNode per cluster.

• Keeps the metadata details

DataNode:

• Multiple DataNode per cluster

• Read/Write operations

SecondaryNameNode:

• Housekeeping Daemon
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy

As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have been
set, a pipeline is built. This strategy provides good reliability.
Working with HDFS Commands

Objective: To create a directory (say, sample) in HDFS.

Act:

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

Act:

hadoop fs -put /root/sample/test.txt /sample/test.txt

Objective: To copy a file from HDFS to local file system.

Act:

hadoop fs -get /sample/test.txt /root/sample/testsample.txt

Special Features of HDFS

Data Replication: There is absolutely no need for a client application to

track all blocks. It directs the client to the nearest replica to ensure high
performance.

Data Pipeline: A client application writes a block to the first DataNode in

the pipeline. Then this DataNode takes over and forwards the data to the
next node in the pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.
Processing with Hadoop
What is MapReduce Programming?

MapReduce Programming is a software framework. MapReduce

Programming helps you to process massive amounts of data in parallel.
How MapReduce Programming Works
MapReduce – Word Count Example
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN

(YET ANOTHER RESOURCE NEGOTIATOR)

Limitations of Hadoop 1.0 Architecture

1. Single NameNode is responsible for managing entire namespace for Hadoop

Cluster.

2. It has a restricted processing model which is suitable for batch-oriented

MapReduce jobs.

3. Hadoop MapReduce is not suitable for interactive analysis.

4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.

5. MapReduce is responsible for cluster resource management and data

processing.
Hadoop 2 YARN: Taking Hadoop beyond Batch
Hadoop 2 YARN: Taking Hadoop beyond Batch

The fundamental idea behind this architecture is splitting the

JobTracker responsibility of resource management and Job
Scheduling/Monitoring into separate daemons. Daemons that are part of
YARN Architecture are described below.

A Global ResourceManager: Its main responsibility is to distribute

resources among various applications in the system. It has two main
components:

NodeManager: This is a per-machine slave daemon. NodeManager

responsibility is launching the application containers for application
execution. NodeManager monitors the resource usage such as memory,
CPU, disk, network, etc. It then reports the usage of resources to the
global ResourceManager.

Per-application ApplicationMaster: This is an application-specific

entity. Its responsibility is to negotiate required resources for execution
from the ResourceManager. It works along with the NodeManager for
executing and monitoring component tasks.
Interacting with Hadoop Ecosystem
Interacting with Hadoop Ecosytem
Pig : Pig is a data flow system for Hadoop. It uses Pig Latin to specify data
flow. Pig is an alternative to MapReduce Programming. It abstracts some
details and allows you to focus on data processing.

Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.

Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.

HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented

NoSQL database. HBase is used to store billions of rows and millions of
columns. HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of HDFS.
Figure 5.33 depicts the HBase in Hadoop ecosystem.
Answer a few quick questions…
Match the columns

Column A Column B

HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage
Match the columns

Column A Column B

JobTracker Executes Task

MapReduce Schedules Task
TaskTracker Programming Model
Job Configuration Converts input into Key Value pair
Map Job Parameters
Thank You

Comp131 Test
76% (17)
Comp131 Test
84 pages
Module 2
No ratings yet
Module 2
37 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Introduction To
No ratings yet
Introduction To
7 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Big-Data-Unit 4
No ratings yet
Big-Data-Unit 4
99 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
2nd Unit Bda
No ratings yet
2nd Unit Bda
30 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Big Data
No ratings yet
Big Data
67 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Unit I
No ratings yet
Unit I
38 pages
Unit 3
No ratings yet
Unit 3
18 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Data Science
No ratings yet
Data Science
14 pages
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
No ratings yet
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
87 pages
3b. Introduction To Hadoopm Ecosystem - Presentation
No ratings yet
3b. Introduction To Hadoopm Ecosystem - Presentation
26 pages
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
No ratings yet
Hadoop: Fasilkom/Pusilkom UI (Credit: Samuel Louvan)
44 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop
No ratings yet
Hadoop
7 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
No ratings yet
Hadoop Ecosystem: An Introduction: Sneha Mehta, Viral Mehta
6 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
61 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
132 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop
No ratings yet
Hadoop
9 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
Python UNIT III-Part-1
No ratings yet
Python UNIT III-Part-1
34 pages
Introduction To Mongodb
No ratings yet
Introduction To Mongodb
50 pages
Introduction To Cassandra
No ratings yet
Introduction To Cassandra
47 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
29 pages
Data Communication & Network: Unit - 3
No ratings yet
Data Communication & Network: Unit - 3
58 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
52 pages
DECS 43A - Big Data Analysis
No ratings yet
DECS 43A - Big Data Analysis
40 pages
Computer Networks Unit-I-New
No ratings yet
Computer Networks Unit-I-New
102 pages
Ex 4
No ratings yet
Ex 4
7 pages
CS Syllabus 2019 2022
No ratings yet
CS Syllabus 2019 2022
115 pages
Data Communication & Network: Unit - 2
No ratings yet
Data Communication & Network: Unit - 2
72 pages
M.phil Regulations
0% (1)
M.phil Regulations
22 pages
2 Marks Question Bank-Ecom - 15!05!2013
No ratings yet
2 Marks Question Bank-Ecom - 15!05!2013
32 pages
CC Answers
No ratings yet
CC Answers
92 pages
What Is The Difference Between Transparent Table and Pooled Table?
No ratings yet
What Is The Difference Between Transparent Table and Pooled Table?
20 pages
SB Ashok It
No ratings yet
SB Ashok It
16 pages
2C2P June2024 Integration Newsletter
No ratings yet
2C2P June2024 Integration Newsletter
2 pages
08 - AWS Cloud Security and Access Management
No ratings yet
08 - AWS Cloud Security and Access Management
19 pages
Programming Languages For Graphics
No ratings yet
Programming Languages For Graphics
5 pages
Product Brief Sigma
No ratings yet
Product Brief Sigma
3 pages
CT042-3-1-IDB-Lecture 1
No ratings yet
CT042-3-1-IDB-Lecture 1
23 pages
Web Portal For Student Information System: Prepared by
No ratings yet
Web Portal For Student Information System: Prepared by
16 pages
DBMS Lab VIVA Questions
No ratings yet
DBMS Lab VIVA Questions
6 pages
Subject-Distributed Computing: Question Bank For Oral Exam
No ratings yet
Subject-Distributed Computing: Question Bank For Oral Exam
1 page
Cisco Public: White Paper
No ratings yet
Cisco Public: White Paper
22 pages
Magic Quadrant For Network Firewalls, 2021
No ratings yet
Magic Quadrant For Network Firewalls, 2021
41 pages
Success Story BTP PDF
100% (1)
Success Story BTP PDF
3 pages
99782488
No ratings yet
99782488
4 pages
NWare External Control 1-4-3-0 502
100% (1)
NWare External Control 1-4-3-0 502
66 pages
Moocs SQL
No ratings yet
Moocs SQL
10 pages
School Map
No ratings yet
School Map
3 pages
Comprehensive Guide To Python Programming
No ratings yet
Comprehensive Guide To Python Programming
5 pages
MTech Cyber I and II Sem Syllabus
No ratings yet
MTech Cyber I and II Sem Syllabus
24 pages
CLI Duide
No ratings yet
CLI Duide
683 pages
Google Cloud Estimate Summary
No ratings yet
Google Cloud Estimate Summary
3 pages
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
No ratings yet
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
7 pages
Gilat Product Sheet SkyEdge II C Capricorn Pro PDF
No ratings yet
Gilat Product Sheet SkyEdge II C Capricorn Pro PDF
2 pages
Why Extend Hybrid Cloud Operations With Vmware?
No ratings yet
Why Extend Hybrid Cloud Operations With Vmware?
6 pages
5 Capacity Planning
No ratings yet
5 Capacity Planning
35 pages
Record Keeping PDF
100% (1)
Record Keeping PDF
16 pages
A Study On Customer Satisfaction of Mobile Wallet Services Provided by Paytm
No ratings yet
A Study On Customer Satisfaction of Mobile Wallet Services Provided by Paytm
8 pages
Data ONTAP 81 Upgrade and RevertDowngrade Guide
No ratings yet
Data ONTAP 81 Upgrade and RevertDowngrade Guide
157 pages

Introduction To Hadoop

Uploaded by

Introduction To Hadoop

Uploaded by

Chapter 5

Learning Objectives Learning Outcomes

1. To study the features of a) To comprehend the reasons

3. To study HDFS Architecture. c) To comprehend MapReduce

Lecture time 120 to 150 minutes

 Processing Data with Hadoop

 Managing Resources and Application with Hadoop YARN

Its capability to handle massive amounts of data, different

The other considerations are :

• How to Process This Gigantic Store of Data?

Hadoop Core Components:

ClickStream data (mouse clicks) helps you to understand the purchasing

2. Distributed File System.

3. Modeled after Google File System.

5. You can replicate a file for a configured number of times, which is

6. Re-replicates data blocks automatically on nodes that have failed.

• Single NameNode per cluster.

• Multiple DataNode per cluster

Objective: To create a directory (say, sample) in HDFS.

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

hadoop fs -put /root/sample/test.txt /sample/test.txt

Objective: To copy a file from HDFS to local file system.

hadoop fs -get /sample/test.txt /root/sample/testsample.txt

Data Replication: There is absolutely no need for a client application to

Data Pipeline: A client application writes a block to the first DataNode in

MapReduce Programming is a software framework. MapReduce

(YET ANOTHER RESOURCE NEGOTIATOR)

1. Single NameNode is responsible for managing entire namespace for Hadoop

2. It has a restricted processing model which is suitable for batch-oriented

3. Hadoop MapReduce is not suitable for interactive analysis.

5. MapReduce is responsible for cluster resource management and data

The fundamental idea behind this architecture is splitting the

A Global ResourceManager: Its main responsibility is to distribute

NodeManager: This is a per-machine slave daemon. NodeManager

Per-application ApplicationMaster: This is an application-specific

HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented

JobTracker Executes Task

You might also like