Hadoop Release 2.0

Slide 1
www.edureka.in/hadoop
How It Works
LIVE On-Line classes Class recordings in Learning Management System (LMS) Module wise Quizzes, Coding Assignments 24x7 on-demand technical support Project work on large Datasets Online certification exam Lifetime access to the LMS
Complimentary Java Classes
Slide 2
Course Topics
Module 1
Understanding Big Data Hadoop Architecture
Module 5
Analytics using Pig Understanding Pig Latin
Module 2
Introduction to Hadoop 2.x Data loading Techniques Hadoop Project Environment
Module 6
Analytics using Hive Understanding HIVE QL
Module 3
Module 7
Hadoop MapReduce framework Programming in Map Reduce
NoSQL Databases Understanding HBASE Zookeeper
Module 4
Advance MapReduce YARN (MRv2) Architecture Programming in YARN
Module 8
Real world Datasets and Analysis Project Discussion

Slide 3
Topics for Today

What is Big Data? Limitations of the existing solutions Solving the problem with Hadoop Introduction to Hadoop Hadoop Eco-System Hadoop Core Components
HDFS Architecture
MapRedcue Job execution Anatomy of a File Write and Read
Hadoop 2.0 (YARN or MRv2) Architecture
Slide 4
What Is Big Data?

Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.
NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
Slide 5
Un-Structured Data is Exploding
Slide 6
IBMs Definition
Characteristics of Big Data
Volume
Velocity
Variety
Slide 7
Annies Introduction
Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 8
Annies Question
Map the following to corresponding data type: XML Files Word Docs, PDF files, Text files
Hello There!! My name is Annie. Data from Enterprise systems (ERP, CRM etc.) I love quizzes and puzzles and I am here to make you guys think and answer my questions.
E-Mail body
Slide 9
Annies Answer
XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data

Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Slide 10
Further Reading
More on Big Data http://www.edureka.in/blog/the-hype-behind-big-data/ Why Hadoop http://www.edureka.in/blog/why-hadoop/ Opportunities in Hadoop http://www.edureka.in/blog/jobs-in-hadoop/ Big Data http://en.wikipedia.org/wiki/Big_Data
IBMs definition Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/
Slide 11
Common Big Data Customer Scenarios

Web and e-tailing
Recommendation Engines Ad Targeting Search Quality Abuse and Click Fraud Detection
Telecommunications
Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 12
Common Big Data Customer Scenarios (Contd.)

Government
Fraud Detection And Cyber Security Welfare schemes Justice
Healthcare & Life Sciences

Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety
Slide 13
Common Big Data Customer Scenarios (Contd.)

Banks and Financial services
Modeling True Risk Threat Analysis Fraud Detection Trade Surveillance Credit Scoring And Analysis
Retail
Point of sales Transaction Analysis Customer Churn Analysis Sentiment Analysis
Slide 14
Hidden Treasure
Case Study: Sears Holding Corporation
Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business. More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.
Slide 15
Limitations of Existing Data Analytics Architecture

BI Reports + Interactive Apps
A meagre 10% of the ~2PB Data is available for BI
RDBMS (Aggregated Data)
1. Cant explore original high fidelity raw data
Processing
ETL Compute Grid

2. Moving data to compute doesnt scale
90% of the ~2PB Archived
Storage only Grid (original Raw Data)

Storage
3. Premature data death
Mostly Append Collection Instrumentation
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 16
Solution: A Combined Storage Computer Layer

BI Reports + Interactive Apps
RDBMS (Aggregated Data)
No Data Archiving Entire ~2PB Data is available for processing
1. Data Exploration & Advanced analytics
2. Scalable throughput for ETL & aggregation
Hadoop : Storage + Compute Grid
3. Keep data alive forever
Both Storage And Processing
Mostly Append Collection Instrumentation
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. Slide 17
Hadoop Differentiating Factors

Accessible
Simple
Differentiating Factors
Robust
Scalable
Slide 18
Hadoop Its about Scale And Structure

RDBMS
Structured Limited, No Data Processing Standards & Structured Required On write Reads are Fast Software License Known Entity
EDW
MPP RDBMS
Data Types Processing Governance Schema Speed Cost Resources
NoSQL
Multi and Unstructured Processing coupled with Data Loosely Structured Required On Read Writes are Fast Support Only Growing, Complexities, Wide
HADOOP
Interactive OLAP Analytics Complex ACID Transactions Operational Data Store

Slide 19
Best Fit Use
Data Discovery Processing Unstructured Data Massive Storage/Processing

Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
Slide 20
Why DFS?
Read 1 TB Data
1 Machine
10 Machines
45 Minutes
Slide 21
Why DFS?
Read 1 TB Data
1 Machine
10 Machines
45 Minutes
Slide 22
4.5 Minutes
What Is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
Slide 23
Hadoop Key Characteristics

Reliable
Flexible
Hadoop Features
Economical
Scalable
Slide 24
Annies Question
Hadoop is a framework that allows for the distributed processing of: Small Data Sets Large Data Sets
Slide 25
Annies Answer
Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tbs because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
Slide 26
Hadoop Eco-System
Apache Oozie (Workflow)
Hive
DW System
Pig Latin
Data Analysis
Mahout
Machine Learning
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume
Import Or Export
Sqoop
Slide 27
Unstructured or Semi-Structured data
Structured Data www.edureka.in/hadoop
Machine Learning with Mahout

Write intelligent applications using Apache Mahout LinkedIn Recommendations
Hadoop and MapReduce magic in action
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Slide 28
Hadoop Core Components

Hadoop is a system for large scale data processing. It has two main components: HDFS Hadoop Distributed File System (Storage) Distributed across nodes Natively redundant NameNode tracks locations. MapReduce (Processing) Splits a task across processors near the data & assembles results Self-Healing, High Bandwidth Clustered storage JobTracker manages the TaskTrackers
Slide 29
Hadoop Core Components (Contd.)
MapReduce Engine HDFS Cluster
Job Tracker Admin Node Name node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Slide 30
HDFS Architecture
Metadata ops Client

Read Datanodes
NameNode
Metadata (Name, replicas,): /home/foo/data, 3,

Block ops
Datanodes
Replication
Write Rack 1 Client Rack 2
Blocks
Slide 31
Main Components Of HDFS

NameNode:
master of the system maintains and manages the blocks which are present on the DataNodes
DataNodes:
slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients
Slide 32
NameNode Metadata
Meta-data in Memory The entire metadata is in main memory No demand paging of FS meta-data
Types of Metadata List of files List of Blocks for each file List of DataNode for each block File attributes, e.g. access time, replication factor A Transaction Log Records file creations, file deletions. etc
Name Node (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2
Name Node: Keeps track of overall file directory structure and the placement of Data Block
Slide 33
Secondary Name Node

metadata NameNode
Single Point Failure
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode
You give me metadata every hour, I will make it secure
Secondary NameNode
metadata
Slide 34
Annies Question
NameNode? a) c) is the Single Point of Failure in a cluster stores meta-data b) runs on Enterprise-class hardware d) All of the above
Slide 35
Annies Answer
All of the above. NameNode Stores meta-data and runs on reliable high quality hardware because its a Single Point of failure in a
hadoop Cluster.
Slide 36
Annies Question
When the NameNode fails, Secondary NameNode takes over instantly and prevents Cluster Failure: a) TRUE b) FALSE
Slide 37
Annies Answer
False. Secondary NameNode is used for creating NameNode Checkpoints. NameNode can Hello be manually recovered using edits There!!
My name is Annie. and FSImage stored in Secondary NameNode.
I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 38
JobTracker
1. Copy Input Files
DFS
Job.xml. Job.jar. Input Files
3. Get Input Files Info
Client
2. Submit Job User 4. Create Splits
5. Upload Job Information
6. Submit Job
Job Tracker
Slide 39
JobTracker (Contd.)
DFS
Input Spilts As many maps as splits
Client
8. Read Job Files
Job.xml. Job.jar. Maps Reduces
6. Submit Job
Job Tracker
9. Create maps and reduces 7. Initialize Job
Job Queue
Slide 40
JobTracker (Contd.)
H1
Job Tracker
Job Queue
11. Picks Tasks (Data Local if possible) 10. Heartbeat
H3
H4 H5
Task Tracker H1 Task Tracker H3
10. Heartbeat 12. Assign Tasks
Task Tracker H2
10. Heartbeat
10. Heartbeat
Task Tracker H4
Slide 41
Annies Question
Hadoop framework picks which of the following daemon for scheduling a task ? a) namenode b) datanode c) task tracker d) job tracker
Slide 42
Annies Answer
There!!and JobTracker takes care of all theHello job scheduling

assign tasks to TaskTrackers.
My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 43
Anatomy of A File Write

1. Create 3. Write 2. Create
HDFS Client
NameNode NameNode
Distributed File System
7. Complete
4. Write Packet
5. ack Packet
4
Pipeline of Data nodes
4
DataNode DataNode
DataNode DataNode
DataNode
DataNode
Slide 44
Anatomy of A File Read

1. Create 3. Write 2. Get Block locations
HDFS Client
NameNode NameNode
Distributed File System
4. Read 5. Read
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Slide 45
Replication and Rack Awareness
Slide 46
Annies Question
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:

a) TRUE b) FALSE
Slide 47
Annies Answer
True. A files is divided into Blocks, these blocks are written in parallel but the block replication happen in sequence.
Slide 48
Annies Question
A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file: a) b) c) d) can read up to block that's successfully written. can read up to last bit successfully written. Will throw an throw an exception. Cannot see that file until its finished copying.
Slide 49
Annies Answer
Client can read up to the successfully written data block, Answer is (a)
Slide 50
Hadoop 2.x (YARN or MRv2)

HDFS
All name space edits logged to shared NFS storage; single writer (fencing) Secondary Name Node Active NameNode
Shared edit logs
Client YARN
Read edit logs and applies to its own namespace
Standby NameNode
Resource Manager
Data Node
Data Node
Data Node Node Manager

Container App Master
Node Manager
Node Manager
Node Manager
Data Node
Data Node
Slide 51
Further Reading
Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/ Apache Hadoop HDFS Architecture http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/ Hadoop 2.0 and YARN http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
Slide 52
Module-2 Pre-work
Setup the Hadoop development environment using the documents present in the LMS.
Hadoop Installation Setup Cloudera CDH3 Demo VM Hadoop Installation Setup Cloudera CDH4 QuickStart VM Execute Linux Basic Commands Execute HDFS Hands On commands Attempt the Module-1 Assignments present in the LMS.
Slide 53
Thank You
See You in Class Next Week

Hadoop Release 2.0

Uploaded by

Copyright:

Available Formats

Hadoop Release 2.0

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Release 2.0

Uploaded by

Copyright:

Available Formats

Slide 1

Complimentary Java Classes

Analytics using Pig Understanding Pig Latin

Introduction to Hadoop 2.x Data loading Techniques Hadoop Project Environment

Analytics using Hive Understanding HIVE QL

Hadoop MapReduce framework Programming in Map Reduce

NoSQL Databases Understanding HBASE Zookeeper

Advance MapReduce YARN (MRv2) Architecture Programming in YARN

Real world Datasets and Analysis Project Discussion

Topics for Today

Hadoop 2.0 (YARN or MRv2) Architecture

What Is Big Data?

Un-Structured Data is Exploding

Characteristics of Big Data

E-Mail body -> Unstructured Data

IBMs definition Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/

Common Big Data Customer Scenarios

Common Big Data Customer Scenarios (Contd.)

Healthcare & Life Sciences

Common Big Data Customer Scenarios (Contd.)

Point of sales Transaction Analysis Customer Churn Analysis Sentiment Analysis

Insight into data can provide Business Advantage.

Limitations of Existing Data Analytics Architecture

RDBMS (Aggregated Data)

1. Cant explore original high fidelity raw data

ETL Compute Grid

90% of the ~2PB Archived

Storage only Grid (original Raw Data)

3. Premature data death

Mostly Append Collection Instrumentation

Solution: A Combined Storage Computer Layer

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

Hadoop : Storage + Compute Grid

3. Keep data alive forever

Both Storage And Processing

Mostly Append Collection Instrumentation

Hadoop Differentiating Factors

Hadoop Its about Scale And Structure

Interactive OLAP Analytics Complex ACID Transactions Operational Data Store

Best Fit Use

Data Discovery Processing Unstructured Data Massive Storage/Processing

It is an Open-source Data Management with scale-out storage & distributed processing.

Hadoop Key Characteristics

Unstructured or Semi-Structured data

Structured Data www.edureka.in/hadoop

Machine Learning with Mahout

Hadoop and MapReduce magic in action

Hadoop Core Components

Hadoop Core Components (Contd.)

MapReduce Engine HDFS Cluster

Job Tracker Admin Node Name node

Metadata ops Client

Metadata (Name, replicas,): /home/foo/data, 3,

Main Components Of HDFS

Secondary Name Node

You give me metadata every hour, I will make it secure

My name is Annie. and FSImage stored in Secondary NameNode.

3. Get Input Files Info