Hadoop Release 2.0
Hadoop Release 2.0
Hadoop Release 2.0
www.edureka.in/hadoop
How It Works
LIVE On-Line classes Class recordings in Learning Management System (LMS) Module wise Quizzes, Coding Assignments 24x7 on-demand technical support Project work on large Datasets Online certification exam Lifetime access to the LMS
Slide 2
www.edureka.in/hadoop
Course Topics
Module 1
Understanding Big Data Hadoop Architecture
Module 5
Module 2
Module 6
Module 3
Module 7
Module 4
Module 8
Slide 3
HDFS Architecture
MapRedcue Job execution Anatomy of a File Write and Read
Slide 4
www.edureka.in/hadoop
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.
NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
Slide 5
www.edureka.in/hadoop
Slide 6
www.edureka.in/hadoop
IBMs Definition
Volume
Velocity
Variety
Slide 7
www.edureka.in/hadoop
Annies Introduction
Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 8
www.edureka.in/hadoop
Annies Question
Map the following to corresponding data type: XML Files Word Docs, PDF files, Text files
Hello There!! My name is Annie. Data from Enterprise systems (ERP, CRM etc.) I love quizzes and puzzles and I am here to make you guys think and answer my questions.
E-Mail body
Slide 9
www.edureka.in/hadoop
Annies Answer
XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data
Slide 10
www.edureka.in/hadoop
Further Reading
More on Big Data http://www.edureka.in/blog/the-hype-behind-big-data/ Why Hadoop http://www.edureka.in/blog/why-hadoop/ Opportunities in Hadoop http://www.edureka.in/blog/jobs-in-hadoop/ Big Data http://en.wikipedia.org/wiki/Big_Data
Slide 11
www.edureka.in/hadoop
Telecommunications
Customer Churn Prevention Network Performance Optimization Calling Data Record (CDR) Analysis Analyzing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 12
www.edureka.in/hadoop
Health information exchange Gene sequencing Serialization Healthcare service quality improvements Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Slide 13
www.edureka.in/hadoop
Retail
http://wiki.apache.org/hadoop/PoweredBy
Slide 14
www.edureka.in/hadoop
Hidden Treasure
Case Study: Sears Holding Corporation
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data.
Slide 15
www.edureka.in/hadoop
Processing
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
Slide 16
www.edureka.in/hadoop
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. Slide 17
www.edureka.in/hadoop
Simple
Differentiating Factors
Robust
Scalable
Slide 18
www.edureka.in/hadoop
EDW
MPP RDBMS
Data Types Processing Governance Schema Speed Cost Resources
NoSQL
Multi and Unstructured Processing coupled with Data Loosely Structured Required On Read Writes are Fast Support Only Growing, Complexities, Wide
HADOOP
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
Slide 20
www.edureka.in/hadoop
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
45 Minutes
Slide 21
www.edureka.in/hadoop
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
45 Minutes
Slide 22
4.5 Minutes
www.edureka.in/hadoop
What Is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
Slide 23
www.edureka.in/hadoop
Flexible
Hadoop Features
Economical
Scalable
Slide 24
www.edureka.in/hadoop
Annies Question
Hadoop is a framework that allows for the distributed processing of: Small Data Sets Large Data Sets
Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 25
www.edureka.in/hadoop
Annies Answer
Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tbs because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
Slide 26
www.edureka.in/hadoop
Hadoop Eco-System
Apache Oozie (Workflow)
Hive
DW System
Pig Latin
Data Analysis
Mahout
Machine Learning
MapReduce Framework
HBase
HDFS (Hadoop Distributed File System)
Flume
Import Or Export
Sqoop
Slide 27
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
Slide 28
www.edureka.in/hadoop
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Slide 30
www.edureka.in/hadoop
HDFS Architecture
NameNode
Datanodes
Replication
Write Rack 1 Client Rack 2
Blocks
Slide 31
www.edureka.in/hadoop
DataNodes:
slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients
Slide 32
www.edureka.in/hadoop
NameNode Metadata
Meta-data in Memory The entire metadata is in main memory No demand paging of FS meta-data
Types of Metadata List of files List of Blocks for each file List of DataNode for each block File attributes, e.g. access time, replication factor A Transaction Log Records file creations, file deletions. etc
Name Node (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2
Name Node: Keeps track of overall file directory structure and the placement of Data Block
Slide 33
www.edureka.in/hadoop
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode
Secondary NameNode
metadata
Slide 34
www.edureka.in/hadoop
Annies Question
NameNode? a) c) is the Single Point of Failure in a cluster stores meta-data b) runs on Enterprise-class hardware d) All of the above
Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 35
www.edureka.in/hadoop
Annies Answer
All of the above. NameNode Stores meta-data and runs on reliable high quality hardware because its a Single Point of failure in a
hadoop Cluster.
Slide 36
www.edureka.in/hadoop
Annies Question
When the NameNode fails, Secondary NameNode takes over instantly and prevents Cluster Failure: a) TRUE b) FALSE
Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 37
www.edureka.in/hadoop
Annies Answer
False. Secondary NameNode is used for creating NameNode Checkpoints. NameNode can Hello be manually recovered using edits There!!
I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 38
www.edureka.in/hadoop
JobTracker
1. Copy Input Files
DFS
Job.xml. Job.jar. Input Files
Client
2. Submit Job User 4. Create Splits
6. Submit Job
Job Tracker
Slide 39
www.edureka.in/hadoop
JobTracker (Contd.)
DFS
Input Spilts As many maps as splits
Client
6. Submit Job
Job Tracker
Job Queue
Slide 40
www.edureka.in/hadoop
JobTracker (Contd.)
H1
Job Tracker
Job Queue
11. Picks Tasks (Data Local if possible) 10. Heartbeat
H3
H4 H5
Task Tracker H2
10. Heartbeat
10. Heartbeat
Task Tracker H4
Slide 41
www.edureka.in/hadoop
Annies Question
Hadoop framework picks which of the following daemon for scheduling a task ? a) namenode b) datanode c) task tracker d) job tracker
Slide 42
www.edureka.in/hadoop
Annies Answer
My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions.
Slide 43
www.edureka.in/hadoop
HDFS Client
NameNode NameNode
7. Complete
4. Write Packet
5. ack Packet
4
Pipeline of Data nodes
4
DataNode DataNode
DataNode DataNode
DataNode
DataNode
Slide 44
www.edureka.in/hadoop
HDFS Client
NameNode NameNode
4. Read 5. Read
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
Slide 45
www.edureka.in/hadoop
Slide 46
www.edureka.in/hadoop
Annies Question
Slide 47
www.edureka.in/hadoop
Annies Answer
True. A files is divided into Blocks, these blocks are written in parallel but the block replication happen in sequence.
Slide 48
www.edureka.in/hadoop
Annies Question
A file of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file: a) b) c) d) can read up to block that's successfully written. can read up to last bit successfully written. Will throw an throw an exception. Cannot see that file until its finished copying.
Slide 49
www.edureka.in/hadoop
Annies Answer
Client can read up to the successfully written data block, Answer is (a)
Slide 50
www.edureka.in/hadoop
Client YARN
Standby NameNode
Resource Manager
Data Node
Data Node
Node Manager
Container App Master
Node Manager
Container App Master
Node Manager
Container App Master
Data Node
Data Node
Slide 51
www.edureka.in/hadoop
Further Reading
Apache Hadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/ Apache Hadoop HDFS Architecture http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/ Hadoop 2.0 and YARN http://www.edureka.in/blog/apache-hadoop-2-0-and-yarn/
Slide 52
www.edureka.in/hadoop
Module-2 Pre-work
Setup the Hadoop development environment using the documents present in the LMS.
Hadoop Installation Setup Cloudera CDH3 Demo VM Hadoop Installation Setup Cloudera CDH4 QuickStart VM Execute Linux Basic Commands Execute HDFS Hands On commands Attempt the Module-1 Assignments present in the LMS.
Slide 53
www.edureka.in/hadoop
Thank You
See You in Class Next Week