0% found this document useful (0 votes)
4 views11 pages

Exp1 Bda

exp1 bda

Uploaded by

Bhumika Nalawade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Exp1 Bda

exp1 bda

Uploaded by

Bhumika Nalawade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Experiment No.

1
AIM: Hadoop HDFS Practical:

i. HDFS Basics , Hadoop Ecosystem Tools Overview.


ii. Installing Hadoop.
iii. Copying Files to Hadoop.
iv. Coping From, Hadoop File System and Deleting files.
v. Moving and displaying files in HDFS.

Theory:
Hadoop Distributed File System (HDFS) is a key component of the Hadoop ecosystem
designed to handle large volumes of data across distributed clusters. Below is a detailed
overview of HDFS, including its architecture, key features, and how it operates:

HDFS Overview

HDFS is a distributed file system designed to run on commodity hardware and provides high-
throughput access to application data. It's optimized for large-scale data processing and is
highly scalable and fault-tolerant.

Key Components of HDFS

1. NameNode
o Role: The NameNode is the master server responsible for managing the
metadata of the HDFS. It keeps track of the structure of the file system
(directories and files) and their corresponding block locations on the
DataNodes.
o Responsibilities:
 Managing the namespace and metadata operations (e.g., opening,
closing, and renaming files).
 Maintaining the file system tree and metadata for all the files and
directories.
 Managing the list of DataNodes and their health.
2. DataNode
o Role: DataNodes are the worker nodes responsible for storing the actual data
blocks. They periodically send heartbeat signals and block reports to the
NameNode to report their status and the blocks they store.
o Responsibilities:
 Storing data blocks and serving read and write requests from the client.
 Handling block replication and deletion as directed by the NameNode.
3. Secondary NameNode
o Role: Despite the name, it is not a backup NameNode. The Secondary
NameNode periodically merges the namespace image and the edit log files
maintained by the NameNode to prevent the edit logs from growing
indefinitely.
o Responsibilities:
 Periodically checkpoints the namespace metadata and creates a new
snapshot of the file system metadata.

HDFS Features

1. Fault Tolerance
o HDFS provides high fault tolerance by replicating each block of data multiple
times (typically three) across different DataNodes. If a DataNode fails, the
system can still access the data from other replicas.
2. Scalability
o HDFS is designed to scale out horizontally by adding more nodes to the
cluster. It can handle petabytes of data across thousands of machines.
3. High Throughput
o HDFS is optimized for high-throughput data access, making it suitable for
applications that process large datasets in parallel.
4. Data Locality
o HDFS tries to place computation close to the data to minimize network
bandwidth usage. This is crucial for performance in data-intensive
applications.
5. Write-Once, Read-Many Model
o HDFS is optimized for large files that are written once and read multiple
times. This model is well-suited for applications like big data processing.

HDFS Architecture

1. Block Size
o HDFS splits large files into fixed-size blocks (default is 128 MB or 256 MB).
Each block is replicated across multiple DataNodes to ensure reliability and
availability.
2. Replication
o Each block of data is replicated to multiple DataNodes (the default replication
factor is three). The NameNode ensures that the replication factor is
maintained by re-replicating blocks as necessary.
3. Data Storage
o DataNodes store the actual data blocks and handle the read/write operations as
directed by clients and the NameNode.
4. Metadata Storage
o Metadata about the file system is stored in the NameNode's memory, and it
includes information about file names, permissions, and block locations.

HDFS Operations

1. File Write
o A client writes data to HDFS by first contacting the NameNode to obtain a list
of DataNodes. The client then writes the data to these DataNodes in a pipeline
fashion, ensuring replication.
2. File Read
o When reading a file, the client contacts the NameNode to get the list of
DataNodes that have the file's blocks. It then reads the data from the
DataNodes directly.
3. Block Report
o DataNodes periodically send block reports to the NameNode, detailing the
blocks they hold. This helps the NameNode to track block replication and
detect any issues.
4. Heartbeat
o DataNodes send regular heartbeat signals to the NameNode to indicate their
availability. If a DataNode fails to send a heartbeat, the NameNode considers
it as failed and triggers replication of its blocks.

Administration and Management

1. HDFS Commands
o Hadoop provides a set of command-line tools for interacting with HDFS.
Examples include hdfs dfs -ls (list files), hdfs dfs -put (upload files),
and hdfs dfs -get (download files).
2. Monitoring
o Hadoop includes web interfaces (e.g., NameNode UI and ResourceManager
UI) for monitoring cluster health, file system status, and job progress.
3. Data Integrity
o HDFS performs checksum verification on data blocks to ensure data integrity
and detect corruption.

Use Cases

 Big Data Processing: Ideal for large-scale data processing tasks like those performed
by Apache Hadoop MapReduce or Apache Spark.
 Data Warehousing: Suitable for storing large datasets used in data warehousing
applications.
 Log Storage: Commonly used to store logs from various applications for later
analysis.

The Hadoop Distributed File System (HDFS) is a core component of the Apache Hadoop
ecosystem, which is a collection of open-source tools designed to handle big data processing
and storage. The Hadoop ecosystem integrates a range of tools that work together to manage,
process, and analyze large volumes of data. Here’s an overview of the key components of the
Hadoop ecosystem that interact with HDFS:

Core Components

1. HDFS (Hadoop Distributed File System)


o Role: A distributed file system designed for storing vast amounts of data
across a cluster of machines with high fault tolerance.
2. YARN (Yet Another Resource Negotiator)
oRole: Resource management and job scheduling component. YARN manages
and schedules resources in the Hadoop cluster and handles resource allocation
for various applications.
3. MapReduce
o Role: A programming model and processing engine for large-scale data
processing. MapReduce splits tasks into small chunks that are processed in
parallel across the cluster.

Steps for installation of Hadoop:

Step1: Download the java and jdk version 8 and install it.

Step 2: Prepare a separate folder for java and Hadoop in your C drive.
Step 3 : Install the java version in the new prepared folder in C drive:
Step 4: Setup the path in the Environment Variable.
Step 5: Extarct the installed hadoop file in the Hadoop folder.
Step 6: Configure the different types of Hadoop Files

 Core- site.xml

 Hadoop-env.cmd
Set JAVA_HOME path correctly

 Mapreduce-env.cmd

 Hdfs-site.xml
Step 7: Check for java and Hadoop version if they are properly installed or not.

Step 8: Browser windows and opening links


Conclusion:
We studied the installation of Hadoop and its setup with all its instructions.

You might also like