0% found this document useful (0 votes)
23 views

Hadoop Intro and Hdfs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Hadoop Intro and Hdfs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

By: Gaurav

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult
to process them using traditional data processing applications.

 Domains with Large Datasets:  Key challenges


• Meteorology • Capture & Store
• Complex physics simulations • Search
• Biological and environmental research • Sharing & Transfer
• Internet search • Analysis
Dec 2004 : Google GFS paper published
July 2005 : Nutch uses MapReduce
Feb 2006 : Becomes Lucene subproject
Apr 2007 : Yahoo! on 1000-node cluster
Jan 2008 : An Apache Top Level Project
April 2009 : Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes)
April 2009 : 100 terabyte sort in 173 minutes (on 3400 nodes)
:
Nov 2011 : YARN packaged with Hadoop distribution
Advertising Improve effectiveness of advertising and promotions
Telcoms Telcos and Cable Companies Use Hortonworks for Service, Security and Sales

Government Decrease Budget Pressures by Offloading Expensive SQL Workloads

Financial Services Mitigate risk while creating opportunity


Manufacturing Increase production, reduce costs, and improve quality
Oil & Gas Maximize yields and reduce risk in the supply chain
Retail Boost sales in-store and online
Healthcare Deliver better care and streamline operations
Projects Powered by Hadoop
 The Apache™ Hadoop® project develops open-source software for reliable,
scalable, distributed computing.

 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.

 It is designed to scale up from single servers to thousands of machines, each


offering local computation and storage.

 The library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service.
• Distributed, Scalable, and Portable file
Hadoop Distributed system written in Java for the Hadoop
File System framework
• Fault‐Tolerant Storage System

• High-Performance Parallel Data


Map-Reduce Processing
Programming Model • Employs the Divide-Conquer principle
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Job Tracker & Task Trackers

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
NameNode DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
JobTracker Daemons
 Accepts Map-Reduce tasks from the Users
 Assigns tasks to the Task Trackers &
monitors their status
JobTracker TaskTracker
TaskTracker
 Runs Map-Reduce tasks
 Sends heart-beat to Job Tracker
 Retrieves Job resources from HDFS
 Scalability

 Batch-Processing only

 Reliability & Availability

 Partitioning of Resources

 Coupling with MapReduce only


Map-Reduce Programming •High-Performance Parallel Data Processing
•Employs the Divide-Conquer principle
Model

•Yet Another Resource Negotiator


YARN •A framework for cluster’s resource management
•Efficient task schedulers

•Distributed, Scalable, and Portable file system written in Java for the
Hadoop Distributed File Hadoop framework
•HDFS High Availability
System •Fault‐Tolerant Storage System
YARN Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Resource Manager & Node
Manager

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
NameNode DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
ResourceManager Daemons
 Accepts Map-Reduce or Application tasks
from the Users
 Assigns tasks to the NodeManager & Resource Node
monitors their status
NodeManager
Manager Manager
 Runs Application tasks
 Sends heart-beat to ResourceManager
 Retrieves Application resources from HDFS
HDFS Design Goals
 Hardware Failure - Detection of faults and quick, automatic recovery

 Streaming Data Access - High throughput of data access (Batch Processing)

 Large Data Sets - Gigabytes to terabytes in size.

 Simple Coherency Model - Write-once-read-many access model for files

 Moving computation is cheaper than moving data


HDFS Architecture
Namenode

Datanode_1 Datanode_2 Datanode_3


File divided into blocks

Block 1

Block 2

Block 3 HDFS HDFS HDFS


Block 1 Block 2 Block 3 Block 4

Block 4

Storage & Replication of Blocks in HDFS


NameNode and DataNodes : Java Processes responsible for HDFS operations

Data Replication : Blocks of a file are replicated for fault tolerance

Replica Placement : A rack-aware replica placement policy

Replica Selection : Minimize global bandwidth consumption and read latency

File System Namespace : Hierarchical file organization

Safemode : File system consistency check


Blocks
 Minimum amount of data that can be read or write - 128 MB by default

 Minimize the cost of seeks

 A file can be larger than any single disk in the network

 Simplifies the storage subsystem

 Provides fault tolerance and availability


Rack Awareness
 Get maximum performance out of
Hadoop

 Resolution of the slave's DNS name


(also IP address) to a rack id.

 Interface DNSToSwitchMapping
Rack Topology - /rack1 & /rack2
Replica Placement
 Critical to HDFS reliability and
performance

 Improve data reliability,


availability, and network
bandwidth utilization

Distance b/w Nodes


Replica Placement cont..
Default Strategy :
a) First replica on the same node as the client.
b) Second replica is placed on a different rack from the first (off-rack) chosen at random

c) Third replica is placed on the same rack as the second, but on a different node chosen at random.

d) Further replicas are placed on random nodes on the cluster

Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the
reader.
FileSystem Image and Edit Logs
 fsimage file is a persistent checkpoint of the filesystem metadata

 When a client performs a write operation, it is first recorded in the edit log.

 The namenode also has an in-memory representation of the filesystem metadata,


which it updates after the edit log has been modified

 Secondary NameNode is used to produce checkpoints of the primary’s in-memory


filesystem metadata
FileSystem Image Structure
<FS_IMAGE> <SNAPSHOTS NUM_SNAPSHOTS="0">
<IMAGE_VERSION>-47</IMAGE_VERSION> <SNAPSHOT_QUOTA>0</SNAPSHOT_QUOTA>
<NAMESPACE_ID>415263518</NAMESPACE_ID> </SNAPSHOTS>
<GENERATION_STAMP>1000</GENERATION_STAMP> <INODE>
<GENERATION_STAMP_V2>6953</GENERATION_STAMP_V2> <INODE_PATH>/data_in/stock1gbdata</INODE_PATH>
<GENERATION_STAMP_V1_LIMIT>0</GENERATION_STAMP_V1_LIMIT> <INODE_ID>24568</INODE_ID>
<LAST_ALLOCATED_BLOCK_ID>1073747777</LAST_ALLOCATED_BLOCK_ID> <REPLICATION>3</REPLICATION>
<TRANSACTION_ID>62957</TRANSACTION_ID> <MODIFICATION_TIME>2014-10-28 15:58</MODIFICATION_TIME>
<LAST_INODE_ID>24606</LAST_INODE_ID> <ACCESS_TIME>2014-10-28 15:58</ACCESS_TIME>
<SNAPSHOT_COUNTER>0</SNAPSHOT_COUNTER> <BLOCK_SIZE>134217728</BLOCK_SIZE>
<NUM_SNAPSHOTS_TOTAL>0</NUM_SNAPSHOTS_TOTAL> <BLOCKS NUM_BLOCKS="81">
<IS_COMPRESSED>false</IS_COMPRESSED> <BLOCK>
<INODES NUM_INODES="1076"> <BLOCK_ID>1073747677</BLOCK_ID>
<INODE> <NUM_BYTES>134217670</NUM_BYTES>
<INODE_PATH>/</INODE_PATH> <GENERATION_STAMP>6853</GENERATION_STAMP>
<INODE_ID>16385</INODE_ID> </BLOCK>
<REPLICATION>0</REPLICATION> <BLOCK>
<MODIFICATION_TIME>2014-10-20 16:35</MODIFICATION_TIME> <BLOCK_ID>1073747678</BLOCK_ID>
<ACCESS_TIME>1970-01-01 05:30</ACCESS_TIME> <NUM_BYTES>134217646</NUM_BYTES>
<BLOCK_SIZE>0</BLOCK_SIZE> <GENERATION_STAMP>6854</GENERATION_STAMP>
<BLOCKS NUM_BLOCKS="-1"></BLOCKS> </BLOCK>
<NS_QUOTA>9223372036854775807</NS_QUOTA> </BLOCKS>
<DS_QUOTA>-1</DS_QUOTA> <INODE>
<IS_SNAPSHOTTABLE_DIR>true</IS_SNAPSHOTTABLE_DIR> <INODES>
<PERMISSIONS> <INODES_UNDER_CONSTRUCTION
<USER_NAME>hduser</USER_NAME> NUM_INODES_UNDER_CONSTRUCTION="0"></INODES_UNDER_CONSTRUCTION>
<GROUP_NAME>supergroup</GROUP_NAME> <CURRENT_DELEGATION_KEY_ID>0</CURRENT_DELEGATION_KEY_ID>
<PERMISSION_STRING>rwxrwxrwx</PERMISSION_STRING> <DELEGATION_KEYS NUM_DELEGATION_KEYS="0"></DELEGATION_KEYS>
</PERMISSIONS> <DELEGATION_TOKEN_SEQUENCE_NUMBER>0</DELEGATION_TOKEN_SEQU
</INODE> ENCE_NUMBER>
<DELEGATION_TOKENS
NUM_DELEGATION_TOKENS="0"></DELEGATION_TOKENS>
</FS_IMAGE>
Safe Mode
 On start-up, NameNode loads its image file (fsimage) into memory and applies the edits from the edit
log (edits).

 It does the check pointing process itself. without recourse to the Secondary NameNode.

 Namenode is running in safe mode (offers only a read-only view to clients)

 The locations of blocks in the system are not persisted by the NameNode - this information resides with
the DataNodes, in the form of a list of the blocks it is storing.

 Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists

 Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30
seconds.
Administration
 HDFS Trash

 HDFS Quotas

 Safe Mode

 FS Shell

 dfsadmin Command
HDFS Trash – Recycle Bin
When a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.
File : core-site.xml

Property : fs.trash.interval

Description : Number of minutes after which the checkpoint gets deleted.

A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the
file from the HDFS namespace.
File : core-site.xml

Property : fs.trash.checkpoint.interval

Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.

Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.
HDFS Quotas
Name Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.
dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.

dfsadmin -clrQuota <directory>... Remove any name quota for each directory.

Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.

dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.

dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.

Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and
bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota
remaining, the space quota value set, and the available space quota remaining.
fs -count -q <directory>..
FS Shell – Some Basic Commands
 cat
 hadoop fs -cat URI [URI …]
 Copies source paths to stdout.
 cp
 hadoop fs -chgrp [-R] GROUP URI [URI …]
 Change group association of files. With -R, make the change recursively through the directory structure.

chmod
hadoop fs -chmod -R 777 hdfs://nn1.example.com/file1
Change the permissions of files. With -R, make the change recursively through the directory structure.

copyFromLocal / put
hadoop fs -copyFromLocal <localsrc> URI
Copy single src, or multiple srcs from local file system to the destination filesystem

copyToLocal / get
hadoop fs -copyToLocal <localdst>
Copy files to the local file system.
FS Shell – Commands Continued…
 expunge
 hadoop fs –expunge
 Empty the Trash.
 mkdir
 hadoop fs -mkdir <paths>
 Takes path uri's as argument and creates directories.
rmr
 hadoop fs –rmr /user/hadoop/dir
 Recursive version of delete.

Touchz
 hadoop -touchz pathname
 Create a file of zero length.

 du
 hadoop fs -du URI [URI …]
 Displays aggregate length of files contained in the directory or the length of a file in case its just a file.
DfsAdmin Command
 bin/hadoop dfsadmin [Generic Options] [Command Options]

-safemode enter / Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be
leave / get / wait turned off manually as well.

-report Reports basic filesystem information and statistics.

-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the
Namenode and those that should be decommissioned or recommissioned.

-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dir
property. filename is overwritten if it exists. filename will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted
Modes

Local Pseudo Fully


Standalone Distributed Distributed
 Local Standalone (Non-distributed)
• All Hadoop daemons run as a single Java process on a single system
• Useful for debugging

 Pseudo Distributed
• Daemons run on a single-node
• Each Hadoop daemon runs in a separate Java process

 Fully Distributed
• Master-Slave Architecture
• One machine is designated as the NameNode and other as ResourceManager (can be within same machine as
well)
• Rest of the machines in the cluster act as both DataNode and NodeManager
(i) (ii)
(iii) (iv)
Create Establish
Create Hadoop Hadoop
Dedicated User Authentication
folder Configuration
& Group among Nodes

(viii) (v)
(vi)
Run Simple (vii) Remote Copy
Start Hadoop
WordCount Testing Hadoop Hadoop folder
Program Cluster
to Slave Nodes

You might also like