0% found this document useful (0 votes)

4 views11 pages

Exp1 Bda

exp1 bda

Uploaded by

Bhumika Nalawade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views11 pages

Exp1 Bda

exp1 bda

Uploaded by

Bhumika Nalawade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Experiment No.

1
AIM: Hadoop HDFS Practical:

i. HDFS Basics , Hadoop Ecosystem Tools Overview.

ii. Installing Hadoop.
iii. Copying Files to Hadoop.
iv. Coping From, Hadoop File System and Deleting files.
v. Moving and displaying files in HDFS.

Theory:
Hadoop Distributed File System (HDFS) is a key component of the Hadoop ecosystem
designed to handle large volumes of data across distributed clusters. Below is a detailed
overview of HDFS, including its architecture, key features, and how it operates:

HDFS Overview

HDFS is a distributed file system designed to run on commodity hardware and provides high-
throughput access to application data. It's optimized for large-scale data processing and is
highly scalable and fault-tolerant.

Key Components of HDFS

1. NameNode
o Role: The NameNode is the master server responsible for managing the
metadata of the HDFS. It keeps track of the structure of the file system
(directories and files) and their corresponding block locations on the
DataNodes.
o Responsibilities:
 Managing the namespace and metadata operations (e.g., opening,
closing, and renaming files).
 Maintaining the file system tree and metadata for all the files and
directories.
 Managing the list of DataNodes and their health.
2. DataNode
o Role: DataNodes are the worker nodes responsible for storing the actual data
blocks. They periodically send heartbeat signals and block reports to the
NameNode to report their status and the blocks they store.
o Responsibilities:
 Storing data blocks and serving read and write requests from the client.
 Handling block replication and deletion as directed by the NameNode.
3. Secondary NameNode
o Role: Despite the name, it is not a backup NameNode. The Secondary
NameNode periodically merges the namespace image and the edit log files
maintained by the NameNode to prevent the edit logs from growing
indefinitely.
o Responsibilities:
 Periodically checkpoints the namespace metadata and creates a new
snapshot of the file system metadata.

HDFS Features

1. Fault Tolerance
o HDFS provides high fault tolerance by replicating each block of data multiple
times (typically three) across different DataNodes. If a DataNode fails, the
system can still access the data from other replicas.
2. Scalability
o HDFS is designed to scale out horizontally by adding more nodes to the
cluster. It can handle petabytes of data across thousands of machines.
3. High Throughput
o HDFS is optimized for high-throughput data access, making it suitable for
applications that process large datasets in parallel.
4. Data Locality
o HDFS tries to place computation close to the data to minimize network
bandwidth usage. This is crucial for performance in data-intensive
applications.
5. Write-Once, Read-Many Model
o HDFS is optimized for large files that are written once and read multiple
times. This model is well-suited for applications like big data processing.

HDFS Architecture

1. Block Size
o HDFS splits large files into fixed-size blocks (default is 128 MB or 256 MB).
Each block is replicated across multiple DataNodes to ensure reliability and
availability.
2. Replication
o Each block of data is replicated to multiple DataNodes (the default replication
factor is three). The NameNode ensures that the replication factor is
maintained by re-replicating blocks as necessary.
3. Data Storage
o DataNodes store the actual data blocks and handle the read/write operations as
directed by clients and the NameNode.
4. Metadata Storage
o Metadata about the file system is stored in the NameNode's memory, and it
includes information about file names, permissions, and block locations.

HDFS Operations

1. File Write
o A client writes data to HDFS by first contacting the NameNode to obtain a list
of DataNodes. The client then writes the data to these DataNodes in a pipeline
fashion, ensuring replication.
2. File Read
o When reading a file, the client contacts the NameNode to get the list of
DataNodes that have the file's blocks. It then reads the data from the
DataNodes directly.
3. Block Report
o DataNodes periodically send block reports to the NameNode, detailing the
blocks they hold. This helps the NameNode to track block replication and
detect any issues.
4. Heartbeat
o DataNodes send regular heartbeat signals to the NameNode to indicate their
availability. If a DataNode fails to send a heartbeat, the NameNode considers
it as failed and triggers replication of its blocks.

Administration and Management

1. HDFS Commands
o Hadoop provides a set of command-line tools for interacting with HDFS.
Examples include hdfs dfs -ls (list files), hdfs dfs -put (upload files),
and hdfs dfs -get (download files).
2. Monitoring
o Hadoop includes web interfaces (e.g., NameNode UI and ResourceManager
UI) for monitoring cluster health, file system status, and job progress.
3. Data Integrity
o HDFS performs checksum verification on data blocks to ensure data integrity
and detect corruption.

Use Cases

 Big Data Processing: Ideal for large-scale data processing tasks like those performed
by Apache Hadoop MapReduce or Apache Spark.
 Data Warehousing: Suitable for storing large datasets used in data warehousing
applications.
 Log Storage: Commonly used to store logs from various applications for later
analysis.

The Hadoop Distributed File System (HDFS) is a core component of the Apache Hadoop
ecosystem, which is a collection of open-source tools designed to handle big data processing
and storage. The Hadoop ecosystem integrates a range of tools that work together to manage,
process, and analyze large volumes of data. Here’s an overview of the key components of the
Hadoop ecosystem that interact with HDFS:

Core Components

1. HDFS (Hadoop Distributed File System)

o Role: A distributed file system designed for storing vast amounts of data
across a cluster of machines with high fault tolerance.
2. YARN (Yet Another Resource Negotiator)
oRole: Resource management and job scheduling component. YARN manages
and schedules resources in the Hadoop cluster and handles resource allocation
for various applications.
3. MapReduce
o Role: A programming model and processing engine for large-scale data
processing. MapReduce splits tasks into small chunks that are processed in
parallel across the cluster.

Steps for installation of Hadoop:

Step1: Download the java and jdk version 8 and install it.

Step 2: Prepare a separate folder for java and Hadoop in your C drive.
Step 3 : Install the java version in the new prepared folder in C drive:
Step 4: Setup the path in the Environment Variable.
Step 5: Extarct the installed hadoop file in the Hadoop folder.
Step 6: Configure the different types of Hadoop Files

 Core- site.xml

 Hadoop-env.cmd
Set JAVA_HOME path correctly

 Mapreduce-env.cmd

 Hdfs-site.xml
Step 7: Check for java and Hadoop version if they are properly installed or not.

Step 8: Browser windows and opening links

Conclusion:
We studied the installation of Hadoop and its setup with all its instructions.

Big Data Unit-3
No ratings yet
Big Data Unit-3
46 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
5 Final Hadoop Ecosystem Hdfs
No ratings yet
5 Final Hadoop Ecosystem Hdfs
130 pages
Unit 3 Full
No ratings yet
Unit 3 Full
89 pages
Module 4 - Hadoop HDFS
No ratings yet
Module 4 - Hadoop HDFS
102 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
HDFS (27 Jan 2025 Hadoop Distributed File System)
No ratings yet
HDFS (27 Jan 2025 Hadoop Distributed File System)
73 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
HDFS
No ratings yet
HDFS
16 pages
Bigdta Unit 3
No ratings yet
Bigdta Unit 3
65 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
18 pages
Unit 4
No ratings yet
Unit 4
104 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
20 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
Distributed Computing System Quiz Questions
75% (4)
Distributed Computing System Quiz Questions
9 pages
Introduction To HDFS
No ratings yet
Introduction To HDFS
21 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Huawei
No ratings yet
Huawei
32 pages
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Unit 2 Da Material
No ratings yet
Unit 2 Da Material
71 pages
Unit 2
No ratings yet
Unit 2
14 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Hdfs R20it III
No ratings yet
Hdfs R20it III
19 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
HDFS
No ratings yet
HDFS
11 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
BDA - Unit-2
No ratings yet
BDA - Unit-2
24 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Paper Hdfs Summary
No ratings yet
Paper Hdfs Summary
5 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Unit 2
No ratings yet
Unit 2
22 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Complete Hadoop Notes Final
No ratings yet
Complete Hadoop Notes Final
4 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
8 pages
HDFS
No ratings yet
HDFS
1 page
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Storage Migration - Hybrid Array To All-Flash Array: Victor Wu
No ratings yet
Storage Migration - Hybrid Array To All-Flash Array: Victor Wu
30 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Midrange+Storage+Performance+Planning+ +Participant+Guide (PDF) 2
No ratings yet
Midrange+Storage+Performance+Planning+ +Participant+Guide (PDF) 2
235 pages
8 Designing Solutions Using Dell EMC RecoverPoint
No ratings yet
8 Designing Solutions Using Dell EMC RecoverPoint
34 pages
Unit 3 Nosql Databases Adt
No ratings yet
Unit 3 Nosql Databases Adt
64 pages
hsv300 Event CR18CB 09534000
No ratings yet
hsv300 Event CR18CB 09534000
1,102 pages
Content Server Backup and Recovery White Paper PDF
No ratings yet
Content Server Backup and Recovery White Paper PDF
39 pages
NetWorker 8.2 and Service Packs Release Notes
No ratings yet
NetWorker 8.2 and Service Packs Release Notes
167 pages
Hacmp PPRC PDF
No ratings yet
Hacmp PPRC PDF
162 pages
SLT For CAR
No ratings yet
SLT For CAR
8 pages
Cassandra 30
No ratings yet
Cassandra 30
259 pages
Gartner Best Practices in Business Continuity Planning Report
No ratings yet
Gartner Best Practices in Business Continuity Planning Report
22 pages
Chapter One Introduction To Distributed Systems
No ratings yet
Chapter One Introduction To Distributed Systems
39 pages
Azure Backup
No ratings yet
Azure Backup
19 pages
Hitachi Storage Command Suite Replication Manager
No ratings yet
Hitachi Storage Command Suite Replication Manager
21 pages
Chapter 05
No ratings yet
Chapter 05
32 pages
Distributed Databases: Daniel Marcous
No ratings yet
Distributed Databases: Daniel Marcous
41 pages
Module 5
No ratings yet
Module 5
46 pages
SIEM Snowflex Training
No ratings yet
SIEM Snowflex Training
41 pages
04-Hadoop Distributed File System
No ratings yet
04-Hadoop Distributed File System
56 pages
AD Guide
No ratings yet
AD Guide
12 pages
Shareplex 8 6 1 Installationguide en
No ratings yet
Shareplex 8 6 1 Installationguide en
124 pages
Streaming Replication
No ratings yet
Streaming Replication
6 pages
SRM Performance and Best Practices
No ratings yet
SRM Performance and Best Practices
19 pages
RDMA Experience Paper TR-1
No ratings yet
RDMA Experience Paper TR-1
20 pages
Maximum Availability Architecture: Oracle Best Practices For High Availability
No ratings yet
Maximum Availability Architecture: Oracle Best Practices For High Availability
37 pages
SS ZG554
No ratings yet
SS ZG554
13 pages
h12079 VNX Replication Technologies Overview WP
No ratings yet
h12079 VNX Replication Technologies Overview WP
32 pages
Sample SOW
No ratings yet
Sample SOW
6 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Exp1 Bda

Uploaded by

Exp1 Bda

Uploaded by

Experiment No.

i. HDFS Basics , Hadoop Ecosystem Tools Overview.

Key Components of HDFS

Administration and Management

1. HDFS (Hadoop Distributed File System)

Steps for installation of Hadoop:

Step 8: Browser windows and opening links

You might also like