0% found this document useful (0 votes)

20 views7 pages

Hadoop Distributed File System

hadoop

Uploaded by

Ines Ould

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views7 pages

Hadoop Distributed File System

hadoop

Uploaded by

Ines Ould

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

HDFS:

Hadoop Distributed File System (HDFS) is a distributed file system designed to store and
manage large volumes of data across multiple nodes in a Hadoop cluster. Here's an overview
of how HDFS works:

1. **Distributed Storage:**

- HDFS divides large files into smaller blocks, typically 128 MB or 256 MB in size. Each block
is then replicated across multiple DataNodes in the cluster.

- The default replication factor is typically set to three, meaning each block is stored on
three different DataNodes to provide fault tolerance and data durability.

2. **Architecture:**

- HDFS has a master/slave architecture. The main components are the NameNode and
DataNodes.

- **NameNode:**

- The NameNode is the master server that manages the metadata for the file system.

- It keeps track of the structure of the file system tree and metadata for all the files and
directories.

- It does not store the actual data; it maintains the metadata, such as file names,
permissions, and the block locations.

- **DataNode:**

- DataNodes are slave servers responsible for storing and managing the actual data
blocks.

- They report to the NameNode about the list of blocks they are storing.

3. File Write Operation:

- When a client wants to write a file to HDFS, the Hadoop Distributed File System client
communicates with the NameNode to create a new file.

- The NameNode returns a list of DataNodes where the blocks can be stored.

- The client then writes the data to the chosen DataNodes.

4. File Read Operation:

- When a client wants to read a file, it contacts the NameNode to get information about
the file's location.

- The client then directly reads the data from the DataNodes.

5. **Fault Tolerance:**

- HDFS ensures fault tolerance through data replication. Each block is replicated to multiple
DataNodes.
- If a DataNode becomes unavailable or a block is corrupted, the system can retrieve the
data from another replica.

6. Heartbeats and Block Reports:

- DataNodes periodically send heartbeat signals to the NameNode to indicate that they are
alive.

- They also send block reports to inform the NameNode about the list of blocks they are
storing.

7. **Data Locality:**

- HDFS aims to optimize data locality, meaning that computation is performed on the same
node where the data is stored. This reduces network traffic and improves overall
performance.

---------------------------------------------------------------------------------------------------------------------------

HDFS is a fundamental component of the Apache Hadoop framework, providing a scalable

and reliable storage infrastructure for big data processing. It is well-suited for handling large
datasets across distributed clusters of commodity hardware.

Questions:

1. Imagine you have a file that is 1 GB in size. How many blocks will this file be split into in
HDFS? How many copies of each block will be made by default?

2. You have a cluster with one namenode and three datanodes. You want to store a file in
HDFS that is 500 MB in size. How will this file be split across the datanodes?

3. You want to copy a file from your local file system to HDFS. What command would you
use?

4. You want to retrieve a file from HDFS and save it to your local file system. What command
would you use?

5. You have a file stored in HDFS that is 1 GB in size. You want to read the first 100 MB of this
file. How does HDFS handle this request?

6. You have a file stored in HDFS that is 500 MB in size. One of the datanodes storing a block
of this file has failed. How does HDFS handle this situation?

7. You want to delete a file from HDFS. What command would you use?

8. You have a file stored in HDFS that is 1 GB in size. You want to make a copy of this file
within HDFS. What command would you use?

Solutions :

Exercise 1:

- A 1 GB file in HDFS will be split into 4 blocks, each of size 256 MB.

- By default, each block will be copied to 3 different machines, resulting in a total of 12

blocks (4 blocks * 3 copies).
Exercise 2:

- The file will be split into 2 blocks of 256 MB each and 1 block of 32 MB.

- The first two blocks will be stored on two different datanodes, while the third block will be
stored on a third datanode.

Exercise 3:

- To copy a file from the local file system to HDFS, the command is: `hadoop fs -
copyFromLocal <localsrc> <hdfs destination>`.

- For example, to copy a file named `example.txt` from the local file system to the root
directory of HDFS, the command would be: `hadoop fs -copyFromLocal example.txt /`.

Exercise 4:

- To retrieve a file from HDFS and save it to the local file system, the command is: `hadoop fs
-get <src> <localdest>`.

- For example, to retrieve a file named `example.txt` from the root directory of HDFS and
save it to the current directory in the local file system, the command would be: `hadoop fs -
get /example.txt .`.

Exercise 5:

- HDFS will read the first 100 MB of the first block containing the data and return it to the
requesting process.

Exercise 6:

- HDFS will detect the failure of the datanode and replicate the missing block to another
datanode to ensure data reliability.

Exercise 7:

- To delete a file from HDFS, the command is: `hadoop fs -rm <path to file>`.

- For example, to delete a file named `example.txt` from the root directory of HDFS, the
command would be: `hadoop fs -rm /example.txt`.

Exercise 8:

- To make a copy of a file within HDFS, the command is: `hadoop fs -cp <src> <dest>`.

- For example, to make a copy of a file named `example.txt` in the root directory of HDFS
and save it as `example_copy.txt` in the same directory, the command would be: `hadoop fs
-cp /example.txt /example_copy.txt`.

Questions 2:

1. Calculate the total storage capacity required to store 10 files, each of size 300 MB, in
HDFS. Consider the default replication factor of 3.

2. Given a directory in HDFS containing 5 files of sizes 150 MB, 200 MB, 350 MB, 400 MB,
and 500 MB respectively, calculate the total storage space occupied by these files, taking
into account the replication factor.
3. If the default block size in HDFS is 128 MB and a file of size 500 MB is stored, calculate the
number of blocks the file will be split into and the amount of storage space it will occupy,
considering the replication factor.

4. Suppose you have a cluster with 5 datanodes, each with 1 TB of storage capacity. If you
want to store a file of size 2.5 TB in HDFS, determine how the file will be distributed across
the datanodes and whether it can be accommodated within the cluster.

5. Given a scenario where a file of size 800 MB is stored in HDFS with a replication factor of
2, calculate the total storage space occupied by the file and its replicas.

These exercises will provide practical insight into how HDFS manages file sizes and storage
allocation.

Solutions:

Exercise 1:

- Total storage capacity required = 10 files * 300 MB * 3 (replication factor) = 9,000 MB or 9

GB.

Exercise 2:

- Total storage space occupied:

- File sizes: 150 MB, 200 MB, 350 MB, 400 MB, 500 MB

- Total = (150 + 200 + 350 + 400 + 500) * 3 (replication factor) = 3,000 MB or 3 GB.

Exercise 3:

- Number of blocks for a 500 MB file = 500 MB / 128 MB (block size) = 3.91 blocks, rounded
up to 4 blocks.

- Total storage space occupied = 4 blocks * 128 MB * 3 (replication factor) = 1,536 MB or 1.5
GB.

Exercise 4:

- File distribution across 5 datanodes:

- Each datanode will store approximately 500 GB of the file (2.5 TB / 5 datanodes).

- The file can be accommodated within the cluster as each datanode has sufficient storage
capacity.

Exercise 5:

- Total storage space occupied by the file and its replicas:

- Original file size = 800 MB

- Total = 800 MB * 2 (replication factor) = 1,600 MB or 1.6 GB.

explain how the blocks are copied and splited

In Hadoop Distributed File System (HDFS), files are divided into fixed-size blocks, typically
128 MB or 256 MB in size . When a file is stored in HDFS, it is split into these blocks, and
each block is replicated across multiple datanodes for fault tolerance and parallel access , .

Here's an explanation of how the blocks are copied and split in HDFS:

1. Block Splitting:

- When a file is stored in HDFS, it is divided into fixed-size blocks. The last block of a file
may be smaller than the fixed size if the file size is not an exact multiple of the block size .

- The blocks of the same file are not necessarily all stored on the same machine. They are
distributed across different datanodes in the cluster .

2. Block Replication:

- Each block is replicated across multiple datanodes to ensure fault tolerance and data
reliability , .

- By default, each block is replicated three times, meaning there will be three copies of
each block stored on different datanodes , .

- The replication factor can be configured based on the desired level of fault tolerance and
data redundancy.

3. Namenode's Role:

- The namenode maintains the metadata about the blocks and their locations in the
cluster. It keeps track of which blocks belong to which files and on which datanodes the
blocks are located , .

4. Copying Process:

- When a block is written to HDFS, the client interacts with the namenode to determine the
datanodes where the block should be stored. The client then directly writes the block to the
chosen datanodes .

- The datanodes receive the block data and store it locally. The namenode keeps track of
the block locations and their replicas .

In summary, HDFS splits files into fixed-size blocks, replicates each block across multiple
datanodes, and the namenode maintains the metadata about block locations. This approach
provides fault tolerance, parallel access, and efficient data storage and retrieval in
distributed environments.

One of the machines is the HDFS master, called the namenode. This machine contains all the file
names and blocks ,like a big phone book.

•Another machine is the secondarynamenode ,a kind of backup namenode, which saves backup
so f the directory a tregular in tervals.

•Somemachinesareclients.Theseareaccesspointstotheclustertoconnecttoandworkwith.

•All of the r machines are datanodes . They store blocks file content
Data nodes contain blocks .The same blocks are duplicated(replication) on different data
nodes ,generally 3 times .This ensures:

•Data reliability in the event of a datanode failure,

•Parallel access by different processes to the same data.

This is called metadata

The map reduce functions :

FunctionMis a correspondence function. It calculates a value that interests us from a

tuple,

FunctionRis a grouping function (aggregation): maximum, sum, count, average,

distinct.. .

For example, FunctionM retrieves the price of a car, FunctionR calculates the max of a
set of values:

all_prices = list()

for each car:

all_prices. add( getPrice(current car) )

return max(all_prices)

data =

[{'id':1, 'mak':'Renault', 'model':'Clio', 'price':4200},

{'id':2, ‘mark':'Fiat', 'model':'500', 'price':8840},

{'id':3, ‘mark':'Peugeot', 'model':'206', 'price':4300},

{'id':4, ‘mark':'Peugeot', 'model':'306', 'price':6140} ]

#returns the price of the car passed as a parameter

def getPrice (car): returncar[ 'price']

#show car price list

print map(getPrice, data)

# displays the highest price

print reduce(max, map(getPrice, data) )

YARN allows user to launch MapReduce jobs on data present in HDFS ,and to follow

(monitor) their progress ,retrieve the messages(logs)displayed by the programs .

Eventually YARN can move a process from one machine to another in the even to failure
or progress deemed too slow

The Mapfunction receives a pair as input and can produce any number of pairs as

output: none, one, or many, at will. The types of inputs and outputs are as desired The
MAP tasks each process a pair and produce 0..n pairs. The same keys and/or values may
be produced.

the value of type text is one of the lines or one of the tuples file to process

the key of type integer is the position of this line in the file (we call it offset

The Reduce function in MapReduce processes a set of key-value pairs, all sharing the
same key. YARN launches separate Reduce instances for each unique key from Map
outputs. Reduce aggregates or processes values associated with the key and produces
result pairs. YARN manages Reduce instances, each handling pairs with a common key.
Effective MapReduce design considers keys and values, tailoring the Reduce function
logic for specific processing needs.

A MapReduce job consists of several phases:

1.Pre-processing of input data, e.g. decompression of files

2. Split: separation of data into separately processable blocks and put in the

form of (key, value), ex: in lines or in tuples

3. Map: application of the map function on all pairs (key, value) formed from

the input data, this produces other pairs (key, value) as output

4. Shuffle& Sort: redistribution of data so that the pairs produced by Map

having the same keys are on the same machines

5. Reduce: aggregation of pairs with the same key to obtain the

Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Big Data Unit-3 PPT
No ratings yet
Big Data Unit-3 PPT
46 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
HDFS
100% (2)
HDFS
6 pages
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
No ratings yet
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
11 pages
Big Data Refers to Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers to Extremely Large and Complex Datasets That 1
421 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS
No ratings yet
HDFS
22 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Big data aktu unit 3
No ratings yet
Big data aktu unit 3
90 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit - II
No ratings yet
Unit - II
64 pages
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
No ratings yet
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
5 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Unit 2
No ratings yet
Unit 2
22 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Unit 2-HDFS SGS
No ratings yet
Unit 2-HDFS SGS
29 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
bdh_unit_3
No ratings yet
bdh_unit_3
25 pages
Hadoop Distributed File System HDFS 1688981751
No ratings yet
Hadoop Distributed File System HDFS 1688981751
49 pages
BG 345
No ratings yet
BG 345
26 pages
BD U-3 (Anupam Sir)
No ratings yet
BD U-3 (Anupam Sir)
23 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
huawei
No ratings yet
huawei
32 pages
HDFS(27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS(27 Jan 2025 Hadoop Distributed File System)
73 pages
Read Write in HDFS
No ratings yet
Read Write in HDFS
6 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Unit 3 mapreduce
No ratings yet
Unit 3 mapreduce
14 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
HDFS
No ratings yet
HDFS
15 pages
BD U-3 Notes
No ratings yet
BD U-3 Notes
27 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
UNIT-3-1 (1)
No ratings yet
UNIT-3-1 (1)
20 pages
Notes - 3 Unit neha
No ratings yet
Notes - 3 Unit neha
25 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
No ratings yet
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
10 pages
Introduction To Hadoop Distributed File System (HDFS)
No ratings yet
Introduction To Hadoop Distributed File System (HDFS)
22 pages
Computer Science Apprenticeship Bigdata Assignement3
No ratings yet
Computer Science Apprenticeship Bigdata Assignement3
3 pages
Big Data Assignment 3
No ratings yet
Big Data Assignment 3
3 pages
Hadoop_HDFS_Notes
No ratings yet
Hadoop_HDFS_Notes
4 pages
Unit 3
No ratings yet
Unit 3
5 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE -mini xerox - easy read
16 pages
3_HDFS-Hive-HBase-Pig
No ratings yet
3_HDFS-Hive-HBase-Pig
8 pages
Exp1 Bda
No ratings yet
Exp1 Bda
11 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
HDFS
No ratings yet
HDFS
6 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
Access Management Policy and Procedure
100% (4)
Access Management Policy and Procedure
44 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Another 100 JavaScript Projects
No ratings yet
Another 100 JavaScript Projects
3 pages
Valeo v34p St10f273 Psa Main Connector
No ratings yet
Valeo v34p St10f273 Psa Main Connector
6 pages
Document
No ratings yet
Document
5 pages
STAG-TAP-03_2 Wiring diagram ENG[2015.03.19]
No ratings yet
STAG-TAP-03_2 Wiring diagram ENG[2015.03.19]
1 page
Chapter 9 Introduction To Data Link Layer
100% (1)
Chapter 9 Introduction To Data Link Layer
35 pages
1.3 Advanced Impedance Matching
No ratings yet
1.3 Advanced Impedance Matching
42 pages
CDJ 2000
100% (1)
CDJ 2000
166 pages
CSBS Syllabus
No ratings yet
CSBS Syllabus
8 pages
Mapo Lcy
No ratings yet
Mapo Lcy
76 pages
9454Lab Manual Expt No. 6 AOA - All Pair Shortest Path
No ratings yet
9454Lab Manual Expt No. 6 AOA - All Pair Shortest Path
8 pages
Sampling Theorem
No ratings yet
Sampling Theorem
92 pages
XR07CX - Level 2
No ratings yet
XR07CX - Level 2
4 pages
English Paper Reading & Writing
No ratings yet
English Paper Reading & Writing
32 pages
16_issues in Failure Recovery
No ratings yet
16_issues in Failure Recovery
5 pages
Amna Mubeen Merged Synopsis File
No ratings yet
Amna Mubeen Merged Synopsis File
52 pages
Remote Alarm Notification: Moeller Intelligent Relays
No ratings yet
Remote Alarm Notification: Moeller Intelligent Relays
12 pages
Hovsep Papoyan - C++ Software Engineer: Professional Profile
No ratings yet
Hovsep Papoyan - C++ Software Engineer: Professional Profile
3 pages
HP Envy 14 Inventec Romeo DIS EV145I 6050A2316601-MB-A03 MV 0420
No ratings yet
HP Envy 14 Inventec Romeo DIS EV145I 6050A2316601-MB-A03 MV 0420
66 pages
Sparrow: Distributed, Low Latency Scheduling
No ratings yet
Sparrow: Distributed, Low Latency Scheduling
16 pages
Correlation in The Stock Market
No ratings yet
Correlation in The Stock Market
12 pages
Template For Journal of The London Mathematical Society Lms
No ratings yet
Template For Journal of The London Mathematical Society Lms
4 pages
Certified Data Analyst - Ain GenX (PVT.) Ltd.
No ratings yet
Certified Data Analyst - Ain GenX (PVT.) Ltd.
11 pages
"3 GB Cul and 4 GB Cul": Unlimited Broadband Combo (Data + Voice) Plans
No ratings yet
"3 GB Cul and 4 GB Cul": Unlimited Broadband Combo (Data + Voice) Plans
1 page
1988 Analysis and Design of Single Pole Transmission Structure
No ratings yet
1988 Analysis and Design of Single Pole Transmission Structure
12 pages
Role of Government To Prevent and Control Cyber Crime in India
No ratings yet
Role of Government To Prevent and Control Cyber Crime in India
13 pages
2013-03-15 "Joule Thief" Powered by .040 V Thermocouple - RustyBolt - Info - Wordpress
No ratings yet
2013-03-15 "Joule Thief" Powered by .040 V Thermocouple - RustyBolt - Info - Wordpress
1 page
16210
No ratings yet
16210
1 page
Smart 10 Ecdis
100% (1)
Smart 10 Ecdis
5 pages
FreeBSD Mastery: Specialty Filesystems: IT Mastery, #8
From Everand
FreeBSD Mastery: Specialty Filesystems: IT Mastery, #8
Michael W. Lucas
No ratings yet
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
From Everand
FreeBSD Mastery: Advanced ZFS: IT Mastery, #9
Michael W. Lucas
No ratings yet