Hadoop Distributed File System
Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is a distributed file system designed to store and
manage large volumes of data across multiple nodes in a Hadoop cluster. Here's an overview
of how HDFS works:
1. **Distributed Storage:**
- HDFS divides large files into smaller blocks, typically 128 MB or 256 MB in size. Each block
is then replicated across multiple DataNodes in the cluster.
- The default replication factor is typically set to three, meaning each block is stored on
three different DataNodes to provide fault tolerance and data durability.
2. **Architecture:**
- HDFS has a master/slave architecture. The main components are the NameNode and
DataNodes.
- **NameNode:**
- The NameNode is the master server that manages the metadata for the file system.
- It keeps track of the structure of the file system tree and metadata for all the files and
directories.
- It does not store the actual data; it maintains the metadata, such as file names,
permissions, and the block locations.
- **DataNode:**
- DataNodes are slave servers responsible for storing and managing the actual data
blocks.
- They report to the NameNode about the list of blocks they are storing.
- When a client wants to write a file to HDFS, the Hadoop Distributed File System client
communicates with the NameNode to create a new file.
- The NameNode returns a list of DataNodes where the blocks can be stored.
- When a client wants to read a file, it contacts the NameNode to get information about
the file's location.
- The client then directly reads the data from the DataNodes.
5. **Fault Tolerance:**
- HDFS ensures fault tolerance through data replication. Each block is replicated to multiple
DataNodes.
- If a DataNode becomes unavailable or a block is corrupted, the system can retrieve the
data from another replica.
- DataNodes periodically send heartbeat signals to the NameNode to indicate that they are
alive.
- They also send block reports to inform the NameNode about the list of blocks they are
storing.
7. **Data Locality:**
- HDFS aims to optimize data locality, meaning that computation is performed on the same
node where the data is stored. This reduces network traffic and improves overall
performance.
---------------------------------------------------------------------------------------------------------------------------
Questions:
1. Imagine you have a file that is 1 GB in size. How many blocks will this file be split into in
HDFS? How many copies of each block will be made by default?
2. You have a cluster with one namenode and three datanodes. You want to store a file in
HDFS that is 500 MB in size. How will this file be split across the datanodes?
3. You want to copy a file from your local file system to HDFS. What command would you
use?
4. You want to retrieve a file from HDFS and save it to your local file system. What command
would you use?
5. You have a file stored in HDFS that is 1 GB in size. You want to read the first 100 MB of this
file. How does HDFS handle this request?
6. You have a file stored in HDFS that is 500 MB in size. One of the datanodes storing a block
of this file has failed. How does HDFS handle this situation?
7. You want to delete a file from HDFS. What command would you use?
8. You have a file stored in HDFS that is 1 GB in size. You want to make a copy of this file
within HDFS. What command would you use?
Solutions :
Exercise 1:
- A 1 GB file in HDFS will be split into 4 blocks, each of size 256 MB.
- The file will be split into 2 blocks of 256 MB each and 1 block of 32 MB.
- The first two blocks will be stored on two different datanodes, while the third block will be
stored on a third datanode.
Exercise 3:
- To copy a file from the local file system to HDFS, the command is: `hadoop fs -
copyFromLocal <localsrc> <hdfs destination>`.
- For example, to copy a file named `example.txt` from the local file system to the root
directory of HDFS, the command would be: `hadoop fs -copyFromLocal example.txt /`.
Exercise 4:
- To retrieve a file from HDFS and save it to the local file system, the command is: `hadoop fs
-get <src> <localdest>`.
- For example, to retrieve a file named `example.txt` from the root directory of HDFS and
save it to the current directory in the local file system, the command would be: `hadoop fs -
get /example.txt .`.
Exercise 5:
- HDFS will read the first 100 MB of the first block containing the data and return it to the
requesting process.
Exercise 6:
- HDFS will detect the failure of the datanode and replicate the missing block to another
datanode to ensure data reliability.
Exercise 7:
- To delete a file from HDFS, the command is: `hadoop fs -rm <path to file>`.
- For example, to delete a file named `example.txt` from the root directory of HDFS, the
command would be: `hadoop fs -rm /example.txt`.
Exercise 8:
- To make a copy of a file within HDFS, the command is: `hadoop fs -cp <src> <dest>`.
- For example, to make a copy of a file named `example.txt` in the root directory of HDFS
and save it as `example_copy.txt` in the same directory, the command would be: `hadoop fs
-cp /example.txt /example_copy.txt`.
Questions 2:
1. Calculate the total storage capacity required to store 10 files, each of size 300 MB, in
HDFS. Consider the default replication factor of 3.
2. Given a directory in HDFS containing 5 files of sizes 150 MB, 200 MB, 350 MB, 400 MB,
and 500 MB respectively, calculate the total storage space occupied by these files, taking
into account the replication factor.
3. If the default block size in HDFS is 128 MB and a file of size 500 MB is stored, calculate the
number of blocks the file will be split into and the amount of storage space it will occupy,
considering the replication factor.
4. Suppose you have a cluster with 5 datanodes, each with 1 TB of storage capacity. If you
want to store a file of size 2.5 TB in HDFS, determine how the file will be distributed across
the datanodes and whether it can be accommodated within the cluster.
5. Given a scenario where a file of size 800 MB is stored in HDFS with a replication factor of
2, calculate the total storage space occupied by the file and its replicas.
These exercises will provide practical insight into how HDFS manages file sizes and storage
allocation.
Solutions:
Exercise 1:
Exercise 2:
- File sizes: 150 MB, 200 MB, 350 MB, 400 MB, 500 MB
- Total = (150 + 200 + 350 + 400 + 500) * 3 (replication factor) = 3,000 MB or 3 GB.
Exercise 3:
- Number of blocks for a 500 MB file = 500 MB / 128 MB (block size) = 3.91 blocks, rounded
up to 4 blocks.
- Total storage space occupied = 4 blocks * 128 MB * 3 (replication factor) = 1,536 MB or 1.5
GB.
Exercise 4:
- Each datanode will store approximately 500 GB of the file (2.5 TB / 5 datanodes).
- The file can be accommodated within the cluster as each datanode has sufficient storage
capacity.
Exercise 5:
In Hadoop Distributed File System (HDFS), files are divided into fixed-size blocks, typically
128 MB or 256 MB in size . When a file is stored in HDFS, it is split into these blocks, and
each block is replicated across multiple datanodes for fault tolerance and parallel access , .
Here's an explanation of how the blocks are copied and split in HDFS:
1. Block Splitting:
- When a file is stored in HDFS, it is divided into fixed-size blocks. The last block of a file
may be smaller than the fixed size if the file size is not an exact multiple of the block size .
- The blocks of the same file are not necessarily all stored on the same machine. They are
distributed across different datanodes in the cluster .
2. Block Replication:
- Each block is replicated across multiple datanodes to ensure fault tolerance and data
reliability , .
- By default, each block is replicated three times, meaning there will be three copies of
each block stored on different datanodes , .
- The replication factor can be configured based on the desired level of fault tolerance and
data redundancy.
3. Namenode's Role:
- The namenode maintains the metadata about the blocks and their locations in the
cluster. It keeps track of which blocks belong to which files and on which datanodes the
blocks are located , .
4. Copying Process:
- When a block is written to HDFS, the client interacts with the namenode to determine the
datanodes where the block should be stored. The client then directly writes the block to the
chosen datanodes .
- The datanodes receive the block data and store it locally. The namenode keeps track of
the block locations and their replicas .
In summary, HDFS splits files into fixed-size blocks, replicates each block across multiple
datanodes, and the namenode maintains the metadata about block locations. This approach
provides fault tolerance, parallel access, and efficient data storage and retrieval in
distributed environments.
One of the machines is the HDFS master, called the namenode. This machine contains all the file
names and blocks ,like a big phone book.
•Another machine is the secondarynamenode ,a kind of backup namenode, which saves backup
so f the directory a tregular in tervals.
•Somemachinesareclients.Theseareaccesspointstotheclustertoconnecttoandworkwith.
•All of the r machines are datanodes . They store blocks file content
Data nodes contain blocks .The same blocks are duplicated(replication) on different data
nodes ,generally 3 times .This ensures:
For example, FunctionM retrieves the price of a car, FunctionR calculates the max of a
set of values:
all_prices = list()
return max(all_prices)
data =
The Mapfunction receives a pair as input and can produce any number of pairs as
output: none, one, or many, at will. The types of inputs and outputs are as desired The
MAP tasks each process a pair and produce 0..n pairs. The same keys and/or values may
be produced.
the value of type text is one of the lines or one of the tuples file to process
the key of type integer is the position of this line in the file (we call it offset
The Reduce function in MapReduce processes a set of key-value pairs, all sharing the
same key. YARN launches separate Reduce instances for each unique key from Map
outputs. Reduce aggregates or processes values associated with the key and produces
result pairs. YARN manages Reduce instances, each handling pairs with a common key.
Effective MapReduce design considers keys and values, tailoring the Reduce function
logic for specific processing needs.
2. Split: separation of data into separately processable blocks and put in the
3. Map: application of the map function on all pairs (key, value) formed from
the input data, this produces other pairs (key, value) as output