Introduction To Hadoop
Introduction To Hadoop
Introducing Hadoop
● Large amount of data is generated every day, every minute, every second.
Data: The treasure trove
● Provides business advantage such as generate product recommendations,
inventing new products, analyzing markets, etc..
● Provides few key indicators that can turn the fortune of business.
● Provides room for precise analysis,
Why Hadoop
1. Distributes the data and duplicates chunks of each data file across
several nodes, for example, 25 - 30 is one chunk of data as shown
2. Locally available compute resource is used to process each chunk of
data in parallel.
3. Hadoop Framework handles failover smartly and automatically.
Why Not RDMS
Distributed processing challenges
1. Hardware Failure:
○ Replication
○ Replication factor
Data: The treasure trove
2. How to Process This Gigantic Store of Data?
● Integrate data available across several servers before proceeding to
processing.
● Hadoop solves this problem by using MapReduce Programming. It is a
programming model to process the data
Hadoop overview
1. Open-source software framework to store and process massive amounts of
data in a distributed fashion
2. Basically, Hadoop accomplishes two tasks:
a. Massive data storage.
b. Faster data processing.
Hadoop overview
Hadoop components
Hadoop core comonents
Hadoop ecosystem
● Hadoop ecosystem are support projects to enhance the functionality of
hadoop core components.
Hadoop conceptual layer
High-level architecture of Hadoop
● Hadoop is a distributed Master-slave architecture.
● Master node is known as NameNode
● Slave node is known as DataNode
High-level architecture of Hadoop
High-level architecture of Hadoop
Use Case of Hadoop
ClickStream Data:
HDFS Limitation:
Single name for a cluster. Overwhelming load on single Namenode. In adoop 2.x ,
this is resolved using HDFS federation
Hadoop 2: HDFS
Fundamental Idea:
● The fundamental idea behind the YARN architecture is to splitting
JobTracker responsibility of resource management and Job scheduling
/Monitoring into separate monitoring.
● Daemons of YARN are as follows:
● 1. Global resource manager: Its responsibility is to distribute resources
among various applications in the system. It has 2 main components:
○ a) Scheduler: The pluggble scheduler of resource manager decides
allocations of resources to various running applications. IT does not
monitor status of applications
○ Application manager: It accepts job submissions, Negotiating
resources for executing the application specific Application manager,
restarting application Master in case of failure.
Hadoop 2 YARN: Taking Hadoop beyond batch
Basic Concepts:
Hadoop 2 YARN: Taking Hadoop beyond batch
Hadoop 2 YARN: Taking Hadoop beyond batch