Introduction To Hadoop
Introduction To Hadoop
Introduction to Hadoop
Learning Objectives and Learning Outcomes
Q/A 15 minutes
Agenda
Hadoop - An Introduction
RDBMS versus Hadoop
Distributed Computing Challenges
History of Hadoop
Hadoop Overview
Key Aspects of Hadoop
Hadoop Components
High Level Architecture of Hadoop
Use case for Hadoop
ClickStream Data
Hadoop Distributors
HDFS
HDFS Daemons
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy
Working with HDFS commands
Special Features of HDFS
Agenda
Hadoop Ecosystem
Pig
Hive
Sqoop
HBase
Hadoop – An Introduction
What is Hadoop
Hadoop is:
Ever wondered why Hadoop has been and is one of the most wanted
technologies!!
The key consideration (the rationale behind its huge popularity) is:
• Hardware Failure
HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop High Level Architecture
Use case for Hadoop
ClickStream Data Analysis
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4, which is
described
HDFS Daemons
NameNode:
DataNode:
SecondaryNameNode:
• Housekeeping Daemon
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy
As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have been
set, a pipeline is built. This strategy provides good reliability.
Working with HDFS Commands
Act:
Act:
Act:
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.
Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
Column A Column B
HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage
Match the columns
Column A Column B