0% found this document useful (0 votes)

31 views37 pages

Hadoop Intro and Hdfs

Uploaded by

shivangiyadav09022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views37 pages

Hadoop Intro and Hdfs

Uploaded by

shivangiyadav09022003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

By: Gaurav

Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult
to process them using traditional data processing applications.

 Domains with Large Datasets:  Key challenges

• Meteorology • Capture & Store
• Complex physics simulations • Search
• Biological and environmental research • Sharing & Transfer
• Internet search • Analysis
Dec 2004 : Google GFS paper published
July 2005 : Nutch uses MapReduce
Feb 2006 : Becomes Lucene subproject
Apr 2007 : Yahoo! on 1000-node cluster
Jan 2008 : An Apache Top Level Project
April 2009 : Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes)
April 2009 : 100 terabyte sort in 173 minutes (on 3400 nodes)
:
Nov 2011 : YARN packaged with Hadoop distribution
Advertising Improve effectiveness of advertising and promotions
Telcoms Telcos and Cable Companies Use Hortonworks for Service, Security and Sales

Government Decrease Budget Pressures by Offloading Expensive SQL Workloads

Financial Services Mitigate risk while creating opportunity

Manufacturing Increase production, reduce costs, and improve quality
Oil & Gas Maximize yields and reduce risk in the supply chain
Retail Boost sales in-store and online
Healthcare Deliver better care and streamline operations
Projects Powered by Hadoop
 The Apache™ Hadoop® project develops open-source software for reliable,
scalable, distributed computing.

 The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.

 It is designed to scale up from single servers to thousands of machines, each

offering local computation and storage.

 The library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service.
• Distributed, Scalable, and Portable file
Hadoop Distributed system written in Java for the Hadoop
File System framework
• Fault‐Tolerant Storage System

• High-Performance Parallel Data

Map-Reduce Processing
Programming Model • Employs the Divide-Conquer principle
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Job Tracker & Task Trackers

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
NameNode DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
JobTracker Daemons
 Accepts Map-Reduce tasks from the Users
 Assigns tasks to the Task Trackers &
monitors their status
JobTracker TaskTracker
TaskTracker
 Runs Map-Reduce tasks
 Sends heart-beat to Job Tracker
 Retrieves Job resources from HDFS
 Scalability

 Batch-Processing only

 Reliability & Availability

 Partitioning of Resources

 Coupling with MapReduce only

Map-Reduce Programming •High-Performance Parallel Data Processing
•Employs the Divide-Conquer principle
Model

•Yet Another Resource Negotiator

YARN •A framework for cluster’s resource management
•Efficient task schedulers

•Distributed, Scalable, and Portable file system written in Java for the
Hadoop Distributed File Hadoop framework
•HDFS High Availability
System •Fault‐Tolerant Storage System
YARN Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Resource Manager & Node
Manager

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
 Maps a block to the Datanodes
 Controls read/write access to files
 Manages Replication Engine for Blocks
DataNode
NameNode DataNode
 Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
ResourceManager Daemons
 Accepts Map-Reduce or Application tasks
from the Users
 Assigns tasks to the NodeManager & Resource Node
monitors their status
NodeManager
Manager Manager
 Runs Application tasks
 Sends heart-beat to ResourceManager
 Retrieves Application resources from HDFS
HDFS Design Goals
 Hardware Failure - Detection of faults and quick, automatic recovery

 Streaming Data Access - High throughput of data access (Batch Processing)

 Large Data Sets - Gigabytes to terabytes in size.

 Simple Coherency Model - Write-once-read-many access model for files

 Moving computation is cheaper than moving data

HDFS Architecture
Namenode

Datanode_1 Datanode_2 Datanode_3

File divided into blocks

Block 1

Block 2

Block 3 HDFS HDFS HDFS

Block 1 Block 2 Block 3 Block 4

Block 4

Storage & Replication of Blocks in HDFS

NameNode and DataNodes : Java Processes responsible for HDFS operations

Data Replication : Blocks of a file are replicated for fault tolerance

Replica Placement : A rack-aware replica placement policy

Replica Selection : Minimize global bandwidth consumption and read latency

File System Namespace : Hierarchical file organization

Safemode : File system consistency check

Blocks
 Minimum amount of data that can be read or write - 128 MB by default

 Minimize the cost of seeks

 A file can be larger than any single disk in the network

 Simplifies the storage subsystem

 Provides fault tolerance and availability

Rack Awareness
 Get maximum performance out of
Hadoop

 Resolution of the slave's DNS name

(also IP address) to a rack id.

 Interface DNSToSwitchMapping
Rack Topology - /rack1 & /rack2
Replica Placement
 Critical to HDFS reliability and
performance

 Improve data reliability,

availability, and network
bandwidth utilization

Distance b/w Nodes

Replica Placement cont..
Default Strategy :
a) First replica on the same node as the client.
b) Second replica is placed on a different rack from the first (off-rack) chosen at random

c) Third replica is placed on the same rack as the second, but on a different node chosen at random.

d) Further replicas are placed on random nodes on the cluster

Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the
reader.
FileSystem Image and Edit Logs
 fsimage file is a persistent checkpoint of the filesystem metadata

 When a client performs a write operation, it is first recorded in the edit log.

 The namenode also has an in-memory representation of the filesystem metadata,

which it updates after the edit log has been modified

 Secondary NameNode is used to produce checkpoints of the primary’s in-memory

filesystem metadata
FileSystem Image Structure
<FS_IMAGE> <SNAPSHOTS NUM_SNAPSHOTS="0">
<IMAGE_VERSION>-47</IMAGE_VERSION> <SNAPSHOT_QUOTA>0</SNAPSHOT_QUOTA>
<NAMESPACE_ID>415263518</NAMESPACE_ID> </SNAPSHOTS>
<GENERATION_STAMP>1000</GENERATION_STAMP> <INODE>
<GENERATION_STAMP_V2>6953</GENERATION_STAMP_V2> <INODE_PATH>/data_in/stock1gbdata</INODE_PATH>
<GENERATION_STAMP_V1_LIMIT>0</GENERATION_STAMP_V1_LIMIT> <INODE_ID>24568</INODE_ID>
<LAST_ALLOCATED_BLOCK_ID>1073747777</LAST_ALLOCATED_BLOCK_ID> <REPLICATION>3</REPLICATION>
<TRANSACTION_ID>62957</TRANSACTION_ID> <MODIFICATION_TIME>2014-10-28 15:58</MODIFICATION_TIME>
<LAST_INODE_ID>24606</LAST_INODE_ID> <ACCESS_TIME>2014-10-28 15:58</ACCESS_TIME>
<SNAPSHOT_COUNTER>0</SNAPSHOT_COUNTER> <BLOCK_SIZE>134217728</BLOCK_SIZE>
<NUM_SNAPSHOTS_TOTAL>0</NUM_SNAPSHOTS_TOTAL> <BLOCKS NUM_BLOCKS="81">
<IS_COMPRESSED>false</IS_COMPRESSED> <BLOCK>
<INODES NUM_INODES="1076"> <BLOCK_ID>1073747677</BLOCK_ID>
<INODE> <NUM_BYTES>134217670</NUM_BYTES>
<INODE_PATH>/</INODE_PATH> <GENERATION_STAMP>6853</GENERATION_STAMP>
<INODE_ID>16385</INODE_ID> </BLOCK>
<REPLICATION>0</REPLICATION> <BLOCK>
<MODIFICATION_TIME>2014-10-20 16:35</MODIFICATION_TIME> <BLOCK_ID>1073747678</BLOCK_ID>
<ACCESS_TIME>1970-01-01 05:30</ACCESS_TIME> <NUM_BYTES>134217646</NUM_BYTES>
<BLOCK_SIZE>0</BLOCK_SIZE> <GENERATION_STAMP>6854</GENERATION_STAMP>
<BLOCKS NUM_BLOCKS="-1"></BLOCKS> </BLOCK>
<NS_QUOTA>9223372036854775807</NS_QUOTA> </BLOCKS>
<DS_QUOTA>-1</DS_QUOTA> <INODE>
<IS_SNAPSHOTTABLE_DIR>true</IS_SNAPSHOTTABLE_DIR> <INODES>
<PERMISSIONS> <INODES_UNDER_CONSTRUCTION
<USER_NAME>hduser</USER_NAME> NUM_INODES_UNDER_CONSTRUCTION="0"></INODES_UNDER_CONSTRUCTION>
<GROUP_NAME>supergroup</GROUP_NAME> <CURRENT_DELEGATION_KEY_ID>0</CURRENT_DELEGATION_KEY_ID>
<PERMISSION_STRING>rwxrwxrwx</PERMISSION_STRING> <DELEGATION_KEYS NUM_DELEGATION_KEYS="0"></DELEGATION_KEYS>
</PERMISSIONS> <DELEGATION_TOKEN_SEQUENCE_NUMBER>0</DELEGATION_TOKEN_SEQU
</INODE> ENCE_NUMBER>
<DELEGATION_TOKENS
NUM_DELEGATION_TOKENS="0"></DELEGATION_TOKENS>
</FS_IMAGE>
Safe Mode
 On start-up, NameNode loads its image file (fsimage) into memory and applies the edits from the edit
log (edits).

 It does the check pointing process itself. without recourse to the Secondary NameNode.

 Namenode is running in safe mode (offers only a read-only view to clients)

 The locations of blocks in the system are not persisted by the NameNode - this information resides with
the DataNodes, in the form of a list of the blocks it is storing.

 Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists

 Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30
seconds.
Administration
 HDFS Trash

 HDFS Quotas

 Safe Mode

 FS Shell

 dfsadmin Command
HDFS Trash – Recycle Bin
When a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.
File : core-site.xml

Property : fs.trash.interval

Description : Number of minutes after which the checkpoint gets deleted.

A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the
file from the HDFS namespace.
File : core-site.xml

Property : fs.trash.checkpoint.interval

Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.

Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.
HDFS Quotas
Name Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.
dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.

dfsadmin -clrQuota <directory>... Remove any name quota for each directory.

Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.

dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.

dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.

Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and
bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota
remaining, the space quota value set, and the available space quota remaining.
fs -count -q <directory>..
FS Shell – Some Basic Commands
 cat
 hadoop fs -cat URI [URI …]
 Copies source paths to stdout.
 cp
 hadoop fs -chgrp [-R] GROUP URI [URI …]
 Change group association of files. With -R, make the change recursively through the directory structure.

chmod
hadoop fs -chmod -R 777 hdfs://nn1.example.com/file1
Change the permissions of files. With -R, make the change recursively through the directory structure.

copyFromLocal / put
hadoop fs -copyFromLocal <localsrc> URI
Copy single src, or multiple srcs from local file system to the destination filesystem

copyToLocal / get
hadoop fs -copyToLocal <localdst>
Copy files to the local file system.
FS Shell – Commands Continued…
 expunge
 hadoop fs –expunge
 Empty the Trash.
 mkdir
 hadoop fs -mkdir <paths>
 Takes path uri's as argument and creates directories.
rmr
 hadoop fs –rmr /user/hadoop/dir
 Recursive version of delete.

Touchz
 hadoop -touchz pathname
 Create a file of zero length.

 du
 hadoop fs -du URI [URI …]
 Displays aggregate length of files contained in the directory or the length of a file in case its just a file.
DfsAdmin Command
 bin/hadoop dfsadmin [Generic Options] [Command Options]

-safemode enter / Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be
leave / get / wait turned off manually as well.

-report Reports basic filesystem information and statistics.

-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the
Namenode and those that should be decommissioned or recommissioned.

-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dir
property. filename is overwritten if it exists. filename will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted
Modes

Local Pseudo Fully

Standalone Distributed Distributed
 Local Standalone (Non-distributed)
• All Hadoop daemons run as a single Java process on a single system
• Useful for debugging

 Pseudo Distributed
• Daemons run on a single-node
• Each Hadoop daemon runs in a separate Java process

 Fully Distributed
• Master-Slave Architecture
• One machine is designated as the NameNode and other as ResourceManager (can be within same machine as
well)
• Rest of the machines in the cluster act as both DataNode and NodeManager
(i) (ii)
(iii) (iv)
Create Establish
Create Hadoop Hadoop
Dedicated User Authentication
folder Configuration
& Group among Nodes

(viii) (v)
(vi)
Run Simple (vii) Remote Copy
Start Hadoop
WordCount Testing Hadoop Hadoop folder
Program Cluster
to Slave Nodes

Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Ict Css12 q1 Mod1 Performing Computer Operations Version 1
No ratings yet
Ict Css12 q1 Mod1 Performing Computer Operations Version 1
54 pages
ICT-CSS12 Q1 Mod2
No ratings yet
ICT-CSS12 Q1 Mod2
36 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Unit 5
No ratings yet
Unit 5
101 pages
Computer Studies G9 Revision Questions
No ratings yet
Computer Studies G9 Revision Questions
119 pages
Exercise 1 Managing A Desktop and Windows
No ratings yet
Exercise 1 Managing A Desktop and Windows
5 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Module II
No ratings yet
Module II
46 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Hadoop Intro
No ratings yet
Hadoop Intro
40 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
4
No ratings yet
4
53 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
Unit 2
No ratings yet
Unit 2
56 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Unit 3
No ratings yet
Unit 3
18 pages
Copa MCQ
No ratings yet
Copa MCQ
66 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Big Data Lecture # 05
No ratings yet
Big Data Lecture # 05
22 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Big Data
No ratings yet
Big Data
51 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
DhrubaBorthakur-Hadoop File Systems
No ratings yet
DhrubaBorthakur-Hadoop File Systems
25 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Gui
No ratings yet
Gui
56 pages
SPSS 24.mac PDF
No ratings yet
SPSS 24.mac PDF
2 pages
04 eSCN230 Operation and Maintenance 16 9 (G08128487 - OTHEN A)
No ratings yet
04 eSCN230 Operation and Maintenance 16 9 (G08128487 - OTHEN A)
86 pages
Employability Skills
No ratings yet
Employability Skills
57 pages
AV Linux 6.04 Manual
No ratings yet
AV Linux 6.04 Manual
58 pages
Computer - Class 5 - Study Notes
No ratings yet
Computer - Class 5 - Study Notes
6 pages
ClassIn Training For Tutors (Lastes Version)
100% (1)
ClassIn Training For Tutors (Lastes Version)
62 pages
Cours Windows-1 Version Anglais
No ratings yet
Cours Windows-1 Version Anglais
12 pages
Data Recovery Plan
No ratings yet
Data Recovery Plan
11 pages
Windows Keyboard Shortcuts
No ratings yet
Windows Keyboard Shortcuts
16 pages
10 - Artificial Intelligence
No ratings yet
10 - Artificial Intelligence
126 pages
Script - Canva Essential Training
No ratings yet
Script - Canva Essential Training
53 pages
Khlb2 User Manual PDF 0208
No ratings yet
Khlb2 User Manual PDF 0208
148 pages
Unit 3 (Part A) ICT Skills-II
No ratings yet
Unit 3 (Part A) ICT Skills-II
5 pages
Cloud Lab Report 10
No ratings yet
Cloud Lab Report 10
26 pages
IT Grade 9
No ratings yet
IT Grade 9
4 pages
Cal Manual
No ratings yet
Cal Manual
104 pages
COMPUTER APPRECIATION BY MARVKELE20190611-92850-uqj7i9
No ratings yet
COMPUTER APPRECIATION BY MARVKELE20190611-92850-uqj7i9
43 pages
Google Colab FAQ
No ratings yet
Google Colab FAQ
7 pages
Windows Server 2008R2 AD Backup and Disaster Recovery Procedures
No ratings yet
Windows Server 2008R2 AD Backup and Disaster Recovery Procedures
42 pages
Group Project 3 - OS Installation - ALL
No ratings yet
Group Project 3 - OS Installation - ALL
3 pages
UWF Administrator Application: User Manual
No ratings yet
UWF Administrator Application: User Manual
34 pages
Chapter - 6: Windows and Paintbrush: Operating System
No ratings yet
Chapter - 6: Windows and Paintbrush: Operating System
10 pages
Assignment 3 Data Recovery 2
No ratings yet
Assignment 3 Data Recovery 2
7 pages
Computer Test 2 Key
No ratings yet
Computer Test 2 Key
5 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Hadoop Intro and Hdfs

Uploaded by

Hadoop Intro and Hdfs

Uploaded by

By: Gaurav

 Domains with Large Datasets:  Key challenges

Government Decrease Budget Pressures by Offloading Expensive SQL Workloads

Financial Services Mitigate risk while creating opportunity

 It is designed to scale up from single servers to thousands of machines, each

• High-Performance Parallel Data

 Reliability & Availability

 Coupling with MapReduce only

•Yet Another Resource Negotiator

 Streaming Data Access - High throughput of data access (Batch Processing)

 Large Data Sets - Gigabytes to terabytes in size.

 Simple Coherency Model - Write-once-read-many access model for files

 Moving computation is cheaper than moving data

Datanode_1 Datanode_2 Datanode_3

Block 3 HDFS HDFS HDFS

Storage & Replication of Blocks in HDFS

Data Replication : Blocks of a file are replicated for fault tolerance

Replica Placement : A rack-aware replica placement policy

Replica Selection : Minimize global bandwidth consumption and read latency

File System Namespace : Hierarchical file organization

Safemode : File system consistency check

 Minimize the cost of seeks

 A file can be larger than any single disk in the network

 Simplifies the storage subsystem

 Provides fault tolerance and availability

 Resolution of the slave's DNS name

 Improve data reliability,

Distance b/w Nodes

d) Further replicas are placed on random nodes on the cluster

 The namenode also has an in-memory representation of the filesystem metadata,

 Secondary NameNode is used to produce checkpoints of the primary’s in-memory

 Namenode is running in safe mode (offers only a read-only view to clients)

Description : Number of minutes after which the checkpoint gets deleted.

-report Reports basic filesystem information and statistics.

Local Pseudo Fully

You might also like