UNIT-5-HDFS (Hadoop Distributed File System)
UNIT-5-HDFS (Hadoop Distributed File System)
HDFS is designed to store and process vast amounts of data efficiently. It is inspired by the Google File System (GFS) and
optimized for big data workloads.
HDFS Concepts
1. NameNode:
o Manages metadata and file directory structure.
o Keeps track of blocks and their locations.
o Heartbeat monitoring of DataNodes.
1. Data Ingestion: The process starts with data ingestion, where raw data from various sources, such as log files, databases, or streaming
data, is collected and ingested into the Hadoop ecosystem. This can be done using tools like Apache Flume or Apache Kafka.
2. Storage: In Hadoop, data is stored in a distributed file system called Hadoop Distributed File System (HDFS). HDFS breaks the data into
blocks and distributes them across multiple machines in a cluster. Each block is replicated for fault tolerance.
3. Data Processing: The core processing engine in Hadoop is Apache MapReduce, although newer frameworks like Apache Spark and
Apache Flink are also commonly used. These frameworks enable distributed processing of data stored in HDFS. MapReduce divides the
4. Data Transformation: Alongside processing, Hadoop provides tools for data transformation. Apache Hive and Apache Pig are popular
high-level languages that allow users to write SQL-like queries (HiveQL) or data transformation scripts (Pig Latin) to manipulate and
analyze the data stored in HDFS.
5. Data Storage: Once the data is processed and transformed, it can be stored in a structured format in systems like Apache HBase or
Apache Cassandra, which provide fast random read/write access to the data.
6. Data Analysis and Visualization: The processed data can be further analyzed and visualized using tools like Apache Spark, Apache
Impala, or business intelligence platforms like Tableau or Power BI. These tools provide interactive querying and visualization capabilities
on large datasets.
7. Data Export: Finally, the analyzed data or the derived insights can be exported to external systems or databases for further consumption
or integration with other applications.
It's important to note that the Hadoop ecosystem is vast and constantly evolving, with new technologies and frameworks being introduced.
The above steps provide a general overview of the data flow in Hadoop, but specific implementations may vary based on the tools and
components used in a particular Hadoop cluster setup.
Write Operation
Read Operation
Replication Process
Let’s get an idea of how data flows between the client interacting with HDFS, the name node, and the data nodes with the help of a diagram. Consider the
figure:
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the files which are already stored in HDFS, but we can append data by
reopening the files.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can include them by again reopening the file. This
design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster. Thus, it increases
the availability, scalability, and throughput of the system.
Hadoop provides tools like Flume and Sqoop to ingest large volumes of data.
Flume Architecture
HAR files are used to combine many small files into larger ones to optimize storage.
Hadoop I/O
Hadoop uses optimized I/O mechanisms for efficient data storage and retrieval.
Hadoop is designed to handle large-scale data efficiently, and one of its key strengths is its optimized Input/Output (I/O)
mechanisms. These mechanisms help in improving storage efficiency, retrieval speed, and overall performance. The key
techniques include compression, serialization, Avro, and file-based data structures.
1. Compression
Compression plays a crucial role in Hadoop as it reduces the storage footprint and speeds up data transfer across the distributed system.
Hadoop supports multiple compression formats, each with unique characteristics:
By choosing the right compression format, Hadoop users can balance storage efficiency and processing speed based on their use
case.
Serialization is the process of converting objects into a byte stream, which allows data to be efficiently stored and transmitted across the
Hadoop ecosystem.
• Writable objects can be serialized into compact byte streams, making them efficient for network transmission and disk storage.
• Since Hadoop relies heavily on serialization for data exchange between nodes, the Writable interface significantly reduces data size
and improves performance compared to standard Java serialization.
java
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
@Override
out.writeInt(id);
out.writeUTF(name);
@Override
id = in.readInt();
name = in.readUTF();
Avro is a data serialization system used in Hadoop for efficient data storage and retrieval.
Features of Avro:
• Schema-Based:
o Avro uses a schema (written in JSON) to define the data structure, making it easy to process and evolve over time.
o The schema is stored along with the data, ensuring that files can be read without needing external metadata.
• Compact and Fast:
o Avro provides efficient binary serialization, reducing storage size and improving processing speed.
• Supports Multiple Programming Languages:
o Avro files can be used across languages like Java, Python, C, and Ruby, making it ideal for cross-platform data exchange.
json
"type": "record",
"name": "Employee",
"fields": [
This schema helps in self-describing data storage, improving interoperability in big data applications.
Hadoop supports specialized file formats that optimize data storage and retrieval, ensuring high performance for big data processing.
• A MapFile is an extension of a SequenceFile, but with an indexing mechanism for fast key-based retrieval.
• This format is useful when quick random access to data is needed.
Hadoop provides two optimized columnar storage formats for fast analytical queries:
• Parquet:
o Stores data in a column-oriented format.
o Reduces disk I/O by reading only the required columns instead of entire rows.
o Highly efficient for analytical queries in tools like Apache Spark, Hive, and Presto.
o Works best with structured and semi-structured data.
• ORC (Optimized Row Columnar):
o Similar to Parquet but optimized for Apache Hive.
o Provides high compression and indexing, making it faster for large-scale queries.
o Stores metadata within the file, improving query performance significantly.
Both Parquet and ORC improve query performance and storage efficiency, making them the preferred choices for big data analytics.
Hadoop optimizes I/O operations through compression, serialization, Avro, and specialized file formats to ensure efficient data
storage and retrieval.
• Compression techniques help reduce storage space and improve data transfer speeds.
• Serialization (Writable Interface & Avro) enhances data storage and communication efficiency.
• File-based data structures like SequenceFiles, MapFiles, Parquet, and ORC provide optimized data storage formats for
different use cases.
By leveraging these techniques, Hadoop ensures scalability, performance, and reliability in processing massive datasets.