0% found this document useful (0 votes)
23 views8 pages

BDA Experiment1

The document outlines a practical experiment on Hadoop HDFS, covering its installation and basic file operations such as copying, moving, and deleting files. It explains the Hadoop ecosystem, including key components like HDFS, YARN, and MapReduce, as well as various tools for big data management. The conclusion highlights the efficiency and fault tolerance of HDFS in handling large datasets across distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

BDA Experiment1

The document outlines a practical experiment on Hadoop HDFS, covering its installation and basic file operations such as copying, moving, and deleting files. It explains the Hadoop ecosystem, including key components like HDFS, YARN, and MapReduce, as well as various tools for big data management. The conclusion highlights the efficiency and fault tolerance of HDFS in handling large datasets across distributed systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No.

– 136 | BDA
Experiment No: - 1

Aim: Hadoop HDFS Practical:


-HDFS Basics, Hadoop Ecosystem Tools Overview.
-Installing Hadoop.
-Copying File to Hadoop.
-Copy from Hadoop File system and deleting file.
-Moving and displaying files in HDFS.
-Programming exercises on Hadoop

Theory:

Hadoop –
Hadoop is an open-source framework based on Java that manages the storage and processing of large
amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data
and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop
ecosystem:
1. Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS
is a distributed file system in which individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access to application data. In addition,
administrators don’t need to define schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for
managing compute resources in clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce
model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple
different nodes, where each subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other
Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source ecosystem continues to grow and
includes many tools and applications to help collect, store, process, analyse, and manage big data. These
include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.

Hadoop Ecosystem Tools –


The Hadoop ecosystem comprises a suite of tools designed to handle various aspects of big data
management, from storage and processing to querying and analysis. It includes components like HDFS
for distributed storage, MapReduce for batch processing, and Hive for SQL-like querying. These tools
work together to enable scalable, efficient handling of large-scale data across distributed systems.
1. HDFS (Hadoop Distributed File System): A distributed file system designed to store large datasets
across multiple machines, ensuring high fault tolerance. It splits data into blocks and replicates them
across nodes for reliability.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
2. MapReduce: A programming model for processing large data sets in parallel across a cluster. It
simplifies data processing by dividing tasks into 'Map' and 'Reduce' phases for efficient batch processing.
3. YARN (Yet Another Resource Negotiator): Manages and allocates resources to applications running
on a Hadoop cluster. It allows multiple data-processing engines like MapReduce and Spark to run
simultaneously.
4. Hive: A data warehouse system that provides SQL-like querying capabilities over large datasets stored
in HDFS. Hive simplifies data analysis by abstracting MapReduce with HiveQL.
5. Pig: A high-level scripting platform for creating MapReduce programs using the Pig Latin language. It
is ideal for complex data transformations and processing pipelines.
6. HBase: A distributed NoSQL database that runs on top of HDFS, offering real-time read/write access to
large datasets. It's suited for random access patterns and sparse datasets.
7. Sqoop: A tool for transferring bulk data between Hadoop and relational databases. It simplifies the
import/export of data from databases into HDFS for analysis.

8. Flume: A service designed to efficiently collect, aggregate, and move large volumes of log and event
data into Hadoop. It is widely used for streaming data ingestion.
9. Oozie: A workflow scheduling system that orchestrates Hadoop jobs and ensures they are executed in
the correct sequence. It supports time-based scheduling and chaining of tasks.
10. Zookeeper: A distributed coordination service that provides centralized management for configuration,
synchronization, and group services in distributed systems. It ensures high availability.
11. Mahout: A library of scalable machine learning algorithms built to run on Hadoop using MapReduce.
It supports tasks like clustering, classification, and collaborative filtering.
12. Spark: A fast, in-memory data processing engine for both batch and real-time data. It supports a wide
range of tasks, from data processing to machine learning, with APIs in multiple languages.
13. Kafka: A distributed messaging platform designed for handling real-time data streams. It is highly
scalable and used for building real-time pipelines and streaming applications.
14. Flink: A stream processing framework for real-time and batch data analytics. It offers low-latency
processing and fault-tolerant distributed computation.
15. Ambari: A web-based management tool for provisioning, managing, and monitoring Hadoop clusters.
It provides an intuitive interface for configuring and monitoring Hadoop services and cluster health.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Hadoop Installation :
Step 1: Open the Oracle Virtual Box.

Step 2: Click New in the top left corner.


Step 3: Give a name for your Cloudera virtual machine and select Type as ‘Linux’ and Version as
‘Other Linux (64-bit) and click Next.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Step 4: Give atleast 4096 MB of Base Memory + minimum 2 CPUs and click Next.

Step 5: Select the option Use an Existing Virtual Hard Disk File and click the browse link and then
Browse and select the downloaded vmdk file, click open and click on create.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Select the virtual machine and click Start. Wait for all configurations to setup.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Hadoop Commands :

Sr.No. Description Commands


1 Help hdfs dfs -help
2 Listing of files hdfs dfs -ls /
3 Making/creating new directory hdfs dfs -mkdir Input
4 Listing files in root directory hdfs dfs -ls Input
5 Listing files and directories hdfs dfs -ls -R /
recursively
6 Copying files from local hdfs dfs -put /home/cloudera/Desktop/StudentInfo.txt
system into HDFS Input/StudentInfo.txt
7 Retrieving files from HDFS hdfs dfs -get Input/StudentInfo.txt
/home/cloudera/Desktop/StudentInfo1.txt
8 Display file hdfs dfs -cat Input/StudentInfo.txt
9 Display head - first few lines of hdfs dfs -cat Input/StudentInfo.txt
text file
10 Display tail - last few lines of hdfs dfs -tail Input/StudentInfo.txt
text file
11 Deleting files from HDFS hdfs dfs -rm Input/StudentInfo.txt
12 Deleting directory hdfs dfs -rm -r /Input
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Conclusion:
In this experiment, we have successfully learned how to install Cloudera Hadoop, work with
the HDFS, and perform basic file operations like copying, moving, displaying, and deleting
files. HDFS proved its efficiency in handling large datasets across multiple machines with
ease, ensuring fault tolerance and high availability through its distributed nature.

You might also like