0% found this document useful (0 votes)
14 views

Hadoop Intro

hadoop

Uploaded by

215059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Hadoop Intro

hadoop

Uploaded by

215059
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Intro to Hadoop

Agenda
 Introduction to Hadoop

 Hadoop nodes & daemons

 Hadoop Architecture

 Characteristics

 Hadoop Features
What is Hadoop?
An Open Source framework that allows distributed processing of
large data-sets across the cluster of commodity hardware
What is Hadoop?
The Technology that empowers Yahoo, Facebook, Twitter, Walmart and others

Hadoop
What is Hadoop?

An Open Source framework that Open Source


allows distributed processing of
large data-sets across the cluster  Source code is freely available
of commodity hardware  It may be redistributed and
modified
What is Hadoop?

An open source framework that Distributed Processing


allows Distributed Processing of
large data-sets across the  Data is processed distributedly
cluster of commodity hardware on multiple nodes / servers
 Multiple machines processes
the data independently
What is Hadoop?

An open source framework that Cluster


allows distributed processing of
large data-sets across the  Multiple machines connected
Cluster of commodity hardware together
 Nodes are connected via LAN
What is Hadoop?

An open source framework that Commodity Hardware


allows distributed processing of
large data-sets across the  Economic / affordable
cluster of Commodity Hardware machines
 Typically low performance
hardware
What is Hadoop?

• Open source framework written in Java


• Inspired by Google's Map-Reduce programming model as well as its file
system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became


Development of
MapReduce papers top-level project
started as Lucene sub-project

launched Hive,
SQL Support for Hadoop
Hadoop Components
Hadoop consists of three key parts
Hadoop Nodes
Nodes

Master Node Slave Node


Hadoop Daemons
Nodes

Master Node Slave Node

Resource Node
Manager Manager

NameNode DataNode
Basic Hadoop Architecture
Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Work Sub Work Sub Work Sub Work Sub Work

USER
MASTER(S) Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

100 SLAVES
Hadoop Characteristics
Distributed
Open Source Processing

Fault Tolerance

Easy to use

Reliability

Economic
High Availability
Scalability
Open Source

• Source code is freely available


• Can be redistributed Free Transparent

• Can be modified
Inter- Open Affordable
operable
Source

No vendor
Community
lock
Distributed Processing

• Data is processed distributedly on


cluster
• Multiple nodes in the cluster
process data independently

Centralized Processing

Distributed Processing
Fault Tolerance

• Failure of nodes are recovered


automatically
• Framework takes care of failure of
hardware as well tasks
Reliability

• Data is reliably stored on the


cluster of machines despite
machine failures
• Failure of nodes doesn’t cause
data loss
High Availability

• Data is highly available and


accessible despite hardware
failure
• There will be no downtime for
end user application due to data

USER
Scalability

• Vertical Scalability – New


hardware can be added to the
nodes

• Horizontal Scalability – New


nodes can be added on the fly
Economic

• No need to purchase costly license


• No need to purchase costly hardware

Open Source
+
Commodity
Hardware = Economic
Easy to Use

• Distributed computing challenges are


handled by framework
• Client just need to concentrate on
business logic
Data Locality

• Move computation to data instead of Data Data

data to computation
• Data is processed on the nodes Data Data

where it is stored Storage Servers App Servers

Algo Algo
Data Data
Algorithm
Algo Algo
Data Data

Servers
Summary

• Everyday we generate 2.3 trillion GBs of data


• Hadoop handles huge volumes of data efficiently
• Hadoop uses the power of distributed computing
• HDFS & Yarn are two main components of Hadoop
• It is highly fault tolerant, reliable & available

You might also like