0% found this document useful (0 votes)
6 views18 pages

UNIT-5-HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) is designed for efficient storage and processing of large data sets, featuring scalability, fault tolerance, and a write-once, read-many model. It operates on a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks, while supporting various interfaces for data interaction. The document also discusses data ingestion tools like Flume and Sqoop, optimized I/O mechanisms, and specialized file formats such as Avro, Parquet, and ORC for efficient data handling.

Uploaded by

sonali kharade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

UNIT-5-HDFS (Hadoop Distributed File System)

HDFS (Hadoop Distributed File System) is designed for efficient storage and processing of large data sets, featuring scalability, fault tolerance, and a write-once, read-many model. It operates on a master-slave architecture with a NameNode managing metadata and DataNodes storing actual data blocks, while supporting various interfaces for data interaction. The document also discusses data ingestion tools like Flume and Sqoop, optimized I/O mechanisms, and specialized file formats such as Avro, Parquet, and ORC for efficient data handling.

Uploaded by

sonali kharade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit-5: HDFS (Hadoop Distributed File System)

HDFS is designed to store and process vast amounts of data efficiently. It is inspired by the Google File System (GFS) and
optimized for big data workloads.

Key Features of HDFS Design

1. Scalability: Can handle petabytes of data across multiple machines.


2. Fault Tolerance: Data is replicated across multiple nodes to prevent data loss.
3. High Throughput: Optimized for large-scale sequential reads rather than random access.
4. Write-Once, Read-Many Model: Files are written once and accessed multiple times, improving data consistency.
5. Master-Slave Architecture: Comprises a NameNode (master) and multiple DataNodes (slaves).
6. Block-Based Storage: Files are split into fixed-size blocks (default 128MB/256MB) for distribution across nodes.

HDFS Concepts

1. NameNode:
o Manages metadata and file directory structure.
o Keeps track of blocks and their locations.
o Heartbeat monitoring of DataNodes.

PROF. SONALI KHARADE 1


2. DataNode:
o Stores actual file blocks.
o Sends periodic heartbeat signals to NameNode.
o Performs read/write operations as per client requests.
3. Secondary NameNode:
o Not a backup of the NameNode but assists in merging edit logs with the filesystem image.
4. Blocks:
o Default block size: 128MB or 256MB.
o A file is split into blocks and distributed across multiple DataNodes.
5. Replication:
o Default replication factor is 3.
o Ensures fault tolerance and high availability.
6. Rack Awareness:
o Hadoop optimizes data placement based on rack topology to minimize network traffic.

PROF. SONALI KHARADE 2


Hadoop File System Interfaces

Hadoop provides multiple ways to interact with HDFS:

1. File System API (Java-based API): Used for programmatic access.


2. Command-Line Interface (CLI): Command-based interaction.
3. Web UI (Namenode Web Interface): Provides a browser-based view of HDFS.
4. Hadoop Archives (HAR): Optimizes storage by combining small files.
5. Other Interfaces:
o FS shell: hdfs dfs -ls /

Data Flow in HDFS

The data flow in Hadoop typically involves the following stages:

1. Data Ingestion: The process starts with data ingestion, where raw data from various sources, such as log files, databases, or streaming
data, is collected and ingested into the Hadoop ecosystem. This can be done using tools like Apache Flume or Apache Kafka.

2. Storage: In Hadoop, data is stored in a distributed file system called Hadoop Distributed File System (HDFS). HDFS breaks the data into
blocks and distributes them across multiple machines in a cluster. Each block is replicated for fault tolerance.

3. Data Processing: The core processing engine in Hadoop is Apache MapReduce, although newer frameworks like Apache Spark and
Apache Flink are also commonly used. These frameworks enable distributed processing of data stored in HDFS. MapReduce divides the

PROF. SONALI KHARADE 3


data into smaller chunks, assigns them to different nodes in the cluster, and processes them in parallel. The processing is performed in two
main steps: the map step, where data is filtered and transformed, and the reduce step, where the processed data is aggregated or analyzed.

4. Data Transformation: Alongside processing, Hadoop provides tools for data transformation. Apache Hive and Apache Pig are popular
high-level languages that allow users to write SQL-like queries (HiveQL) or data transformation scripts (Pig Latin) to manipulate and
analyze the data stored in HDFS.

5. Data Storage: Once the data is processed and transformed, it can be stored in a structured format in systems like Apache HBase or
Apache Cassandra, which provide fast random read/write access to the data.

6. Data Analysis and Visualization: The processed data can be further analyzed and visualized using tools like Apache Spark, Apache
Impala, or business intelligence platforms like Tableau or Power BI. These tools provide interactive querying and visualization capabilities
on large datasets.

7. Data Export: Finally, the analyzed data or the derived insights can be exported to external systems or databases for further consumption
or integration with other applications.

It's important to note that the Hadoop ecosystem is vast and constantly evolving, with new technologies and frameworks being introduced.
The above steps provide a general overview of the data flow in Hadoop, but specific implementations may vary based on the tools and
components used in a particular Hadoop cluster setup.

PROF. SONALI KHARADE 4


HDFS data flow includes various operations:

Write Operation

1. The client requests the NameNode to create a file.


2. The NameNode allocates DataNodes for block storage.
3. The client writes data to the first DataNode, which replicates it to other DataNodes.
4. The NameNode updates the metadata.

Read Operation

1. The client requests the NameNode for the file's location.


2. The client reads data directly from DataNodes.

Replication Process

1. The NameNode decides replication placement.


2. The first copy is stored locally, the second on a different rack, and the third on another node.
3. DataNodes periodically send heartbeats to confirm block availability

PROF. SONALI KHARADE 5


Anatomy of read and write operations in hadoop

Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with HDFS, the name node, and the data nodes with the help of a diagram. Consider the
figure:

PROF. SONALI KHARADE 6


Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file.
For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client
for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the file, then
connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best data node for the next block.
This happens transparently to the client, which from its point of view is simply reading an endless stream. Blocks are read as, with the DFSInputStream
opening new connections to data nodes because the client reads through the stream. It will also call the name node to retrieve the data node locations for
the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream.

PROF. SONALI KHARADE 7


Anatomy of File Write in HDFS

Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the files which are already stored in HDFS, but we can append data by
reopening the files.

PROF. SONALI KHARADE 8


Step 1: The client creates the file by calling create() on Distributed File System(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks associated with it. The name node
performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the
name node prepares a record of the new file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns
an FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue
is consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an inventory of suitable data nodes to store the
replicas. The list of data nodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The
DataStreamer streams the packets to the primary data node within the pipeline, which stores each packet and forwards it to the second data node within
the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments before connecting to the name node to
signal whether the file is complete or not.

HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can include them by again reopening the file. This
design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster. Thus, it increases
the availability, scalability, and throughput of the system.

PROF. SONALI KHARADE 9


Data Ingest with Flume and Sqoop

Hadoop provides tools like Flume and Sqoop to ingest large volumes of data.

Apache Flume (For Streaming Data)

Used to ingest log data, real-time event streams, etc.

Flume Architecture

• Source: Collects data (e.g., log files, HTTP).


• Channel: Temporary storage (e.g., memory, file, Kafka).
• Sink: Writes data to HDFS, HBase, etc.

Apache Sqoop (For Structured Data)

Used to transfer structured data between RDBMS and Hadoop.

Import Data from MySQL to HDFS

PROF. SONALI KHARADE 10


Hadoop Archives (HAR)

HAR files are used to combine many small files into larger ones to optimize storage.

Hadoop I/O

Hadoop uses optimized I/O mechanisms for efficient data storage and retrieval.

Hadoop is designed to handle large-scale data efficiently, and one of its key strengths is its optimized Input/Output (I/O)
mechanisms. These mechanisms help in improving storage efficiency, retrieval speed, and overall performance. The key
techniques include compression, serialization, Avro, and file-based data structures.

1. Compression

Hadoop supports various compression formats:

Compression plays a crucial role in Hadoop as it reduces the storage footprint and speeds up data transfer across the distributed system.
Hadoop supports multiple compression formats, each with unique characteristics:

Common Compression Formats in Hadoop

PROF. SONALI KHARADE 11


• Gzip (*.gz):
o Popular and widely used compression format.
o Not splittable, meaning that when a large file is compressed using Gzip, Hadoop cannot split it across multiple nodes
for parallel processing.
o Works well for reducing file size but may slow down parallel processing in Hadoop.
• Bzip2 (*.bz2):
o Splittable, allowing Hadoop to process different parts of the file simultaneously, making it more efficient for distributed
computing.
o Provides a higher compression ratio than Gzip but is slower in terms of compression and decompression speed.
• LZO:
o Splittable, but requires an index file to enable splitting.
o Optimized for speed rather than high compression.
o Frequently used in real-time applications that require fast decompression.
• Snappy:
o Optimized for speed over compression ratio.
o Useful when quick read/write operations are required.
o Often used with Parquet and ORC formats to enhance performance in big data analytics.

By choosing the right compression format, Hadoop users can balance storage efficiency and processing speed based on their use
case.

PROF. SONALI KHARADE 12


2. Serialization

Serialization is the process of converting objects into a byte stream, which allows data to be efficiently stored and transmitted across the
Hadoop ecosystem.

Writable Interface for Serialization

Hadoop provides a custom serialization framework using the Writable interface.

• Writable objects can be serialized into compact byte streams, making them efficient for network transmission and disk storage.
• Since Hadoop relies heavily on serialization for data exchange between nodes, the Writable interface significantly reduces data size
and improves performance compared to standard Java serialization.

Example of a Writable class in Hadoop:

java

import org.apache.hadoop.io.Writable;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

PROF. SONALI KHARADE 13


public class CustomWritable implements Writable {

private int id;

private String name;

@Override

public void write(DataOutput out) throws IOException {

out.writeInt(id);

out.writeUTF(name);

@Override

public void readFields(DataInput in) throws IOException {

id = in.readInt();

name = in.readUTF();

PROF. SONALI KHARADE 14


}

This lightweight serialization mechanism ensures faster read/write operations in Hadoop.

3. Avro: A Schema-Based Efficient Data Format

Avro is a data serialization system used in Hadoop for efficient data storage and retrieval.

Features of Avro:

• Schema-Based:
o Avro uses a schema (written in JSON) to define the data structure, making it easy to process and evolve over time.
o The schema is stored along with the data, ensuring that files can be read without needing external metadata.
• Compact and Fast:
o Avro provides efficient binary serialization, reducing storage size and improving processing speed.
• Supports Multiple Programming Languages:
o Avro files can be used across languages like Java, Python, C, and Ruby, making it ideal for cross-platform data exchange.

Example of an Avro schema:

json

PROF. SONALI KHARADE 15


{

"type": "record",

"name": "Employee",

"fields": [

{"name": "id", "type": "int"},

{"name": "name", "type": "string"},

{"name": "salary", "type": "float"}

This schema helps in self-describing data storage, improving interoperability in big data applications.

4. File-Based Data Structures in Hadoop

Hadoop supports specialized file formats that optimize data storage and retrieval, ensuring high performance for big data processing.

a. Sequence Files (Binary Format for Key-Value Storage)

PROF. SONALI KHARADE 16


• A binary file format designed for storing large key-value pairs efficiently.
• Often used in MapReduce jobs where intermediate data needs to be stored and processed quickly.
• Supports compression, making it more space-efficient than text files.

b. MapFiles (Indexed Sequence Files for Faster Lookups)

• A MapFile is an extension of a SequenceFile, but with an indexing mechanism for fast key-based retrieval.
• This format is useful when quick random access to data is needed.

c. Columnar Storage Formats: Parquet and ORC

Hadoop provides two optimized columnar storage formats for fast analytical queries:

• Parquet:
o Stores data in a column-oriented format.
o Reduces disk I/O by reading only the required columns instead of entire rows.
o Highly efficient for analytical queries in tools like Apache Spark, Hive, and Presto.
o Works best with structured and semi-structured data.
• ORC (Optimized Row Columnar):
o Similar to Parquet but optimized for Apache Hive.
o Provides high compression and indexing, making it faster for large-scale queries.
o Stores metadata within the file, improving query performance significantly.

Both Parquet and ORC improve query performance and storage efficiency, making them the preferred choices for big data analytics.

PROF. SONALI KHARADE 17


Conclusion

Hadoop optimizes I/O operations through compression, serialization, Avro, and specialized file formats to ensure efficient data
storage and retrieval.

• Compression techniques help reduce storage space and improve data transfer speeds.
• Serialization (Writable Interface & Avro) enhances data storage and communication efficiency.
• File-based data structures like SequenceFiles, MapFiles, Parquet, and ORC provide optimized data storage formats for
different use cases.

By leveraging these techniques, Hadoop ensures scalability, performance, and reliability in processing massive datasets.

PROF. SONALI KHARADE 18

You might also like