BDA Experiment1

The document outlines a practical experiment on Hadoop HDFS, covering its installation and basic file operations such as copying, moving, and deleting files. It explains the Hadoop ecosystem, including key components like HDFS, YARN, and MapReduce, as well as various tools for big data management. The conclusion highlights the efficiency and fault tolerance of HDFS in handling large datasets across distributed systems.

Uploaded by

CEA02BEDARE DHRUV BABASAHEB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

BDA Experiment1

Uploaded by

CEA02BEDARE DHRUV BABASAHEB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No.

– 136 | BDA
Experiment No: - 1

Aim: Hadoop HDFS Practical:

-HDFS Basics, Hadoop Ecosystem Tools Overview.
-Installing Hadoop.
-Copying File to Hadoop.
-Copy from Hadoop File system and deleting file.
-Moving and displaying files in HDFS.
-Programming exercises on Hadoop

Theory:

Hadoop –
Hadoop is an open-source framework based on Java that manages the storage and processing of large
amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data
and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop
ecosystem:
1. Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS
is a distributed file system in which individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access to application data. In addition,
administrators don’t need to define schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for
managing compute resources in clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce
model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple
different nodes, where each subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other
Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source ecosystem continues to grow and
includes many tools and applications to help collect, store, process, analyse, and manage big data. These
include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.

Hadoop Ecosystem Tools –

The Hadoop ecosystem comprises a suite of tools designed to handle various aspects of big data
management, from storage and processing to querying and analysis. It includes components like HDFS
for distributed storage, MapReduce for batch processing, and Hive for SQL-like querying. These tools
work together to enable scalable, efficient handling of large-scale data across distributed systems.
1. HDFS (Hadoop Distributed File System): A distributed file system designed to store large datasets
across multiple machines, ensuring high fault tolerance. It splits data into blocks and replicates them
across nodes for reliability.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
2. MapReduce: A programming model for processing large data sets in parallel across a cluster. It
simplifies data processing by dividing tasks into 'Map' and 'Reduce' phases for efficient batch processing.
3. YARN (Yet Another Resource Negotiator): Manages and allocates resources to applications running
on a Hadoop cluster. It allows multiple data-processing engines like MapReduce and Spark to run
simultaneously.
4. Hive: A data warehouse system that provides SQL-like querying capabilities over large datasets stored
in HDFS. Hive simplifies data analysis by abstracting MapReduce with HiveQL.
5. Pig: A high-level scripting platform for creating MapReduce programs using the Pig Latin language. It
is ideal for complex data transformations and processing pipelines.
6. HBase: A distributed NoSQL database that runs on top of HDFS, offering real-time read/write access to
large datasets. It's suited for random access patterns and sparse datasets.
7. Sqoop: A tool for transferring bulk data between Hadoop and relational databases. It simplifies the
import/export of data from databases into HDFS for analysis.

8. Flume: A service designed to efficiently collect, aggregate, and move large volumes of log and event
data into Hadoop. It is widely used for streaming data ingestion.
9. Oozie: A workflow scheduling system that orchestrates Hadoop jobs and ensures they are executed in
the correct sequence. It supports time-based scheduling and chaining of tasks.
10. Zookeeper: A distributed coordination service that provides centralized management for configuration,
synchronization, and group services in distributed systems. It ensures high availability.
11. Mahout: A library of scalable machine learning algorithms built to run on Hadoop using MapReduce.
It supports tasks like clustering, classification, and collaborative filtering.
12. Spark: A fast, in-memory data processing engine for both batch and real-time data. It supports a wide
range of tasks, from data processing to machine learning, with APIs in multiple languages.
13. Kafka: A distributed messaging platform designed for handling real-time data streams. It is highly
scalable and used for building real-time pipelines and streaming applications.
14. Flink: A stream processing framework for real-time and batch data analytics. It offers low-latency
processing and fault-tolerant distributed computation.
15. Ambari: A web-based management tool for provisioning, managing, and monitoring Hadoop clusters.
It provides an intuitive interface for configuring and monitoring Hadoop services and cluster health.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Hadoop Installation :
Step 1: Open the Oracle Virtual Box.

Step 2: Click New in the top left corner.

Step 3: Give a name for your Cloudera virtual machine and select Type as ‘Linux’ and Version as
‘Other Linux (64-bit) and click Next.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Step 4: Give atleast 4096 MB of Base Memory + minimum 2 CPUs and click Next.

Step 5: Select the option Use an Existing Virtual Hard Disk File and click the browse link and then
Browse and select the downloaded vmdk file, click open and click on create.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Select the virtual machine and click Start. Wait for all configurations to setup.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Hadoop Commands :

Sr.No. Description Commands

1 Help hdfs dfs -help
2 Listing of files hdfs dfs -ls /
3 Making/creating new directory hdfs dfs -mkdir Input
4 Listing files in root directory hdfs dfs -ls Input
5 Listing files and directories hdfs dfs -ls -R /
recursively
6 Copying files from local hdfs dfs -put /home/cloudera/Desktop/StudentInfo.txt
system into HDFS Input/StudentInfo.txt
7 Retrieving files from HDFS hdfs dfs -get Input/StudentInfo.txt
/home/cloudera/Desktop/StudentInfo1.txt
8 Display file hdfs dfs -cat Input/StudentInfo.txt
9 Display head - first few lines of hdfs dfs -cat Input/StudentInfo.txt
text file
10 Display tail - last few lines of hdfs dfs -tail Input/StudentInfo.txt
text file
11 Deleting files from HDFS hdfs dfs -rm Input/StudentInfo.txt
12 Deleting directory hdfs dfs -rm -r /Input
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA

Conclusion:
In this experiment, we have successfully learned how to install Cloudera Hadoop, work with
the HDFS, and perform basic file operations like copying, moving, displaying, and deleting
files. HDFS proved its efficiency in handling large datasets across multiple machines with
ease, ensuring fault tolerance and high availability through its distributed nature.

Bda Notes
No ratings yet
Bda Notes
110 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
OBP Best Practice
100% (1)
OBP Best Practice
89 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Chapter3 Salient Pole Synchronous Machines
No ratings yet
Chapter3 Salient Pole Synchronous Machines
23 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data - Tomas Iglesias IV
No ratings yet
Big Data - Tomas Iglesias IV
37 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Hadoop
No ratings yet
Hadoop
61 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
56 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Getting Started With HDP Sandbox
No ratings yet
Getting Started With HDP Sandbox
107 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Technical Spec. of 100KVA
No ratings yet
Technical Spec. of 100KVA
7 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Big Data
No ratings yet
Big Data
27 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
BIG Data Master
No ratings yet
BIG Data Master
24 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Lab Manual Big Data
No ratings yet
Lab Manual Big Data
22 pages
Apache Hadoop Ecosystem
No ratings yet
Apache Hadoop Ecosystem
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
13 pages
2.2. Components of Hadoop - Analysing
No ratings yet
2.2. Components of Hadoop - Analysing
16 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 4
No ratings yet
Unit 4
14 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Attachment
No ratings yet
Attachment
11 pages
Unit 3
No ratings yet
Unit 3
12 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
Unit 2
No ratings yet
Unit 2
9 pages
Bda 2
No ratings yet
Bda 2
25 pages
Hadoop 1
No ratings yet
Hadoop 1
8 pages
Hadoop
No ratings yet
Hadoop
11 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
8 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Bda 1 Exp
No ratings yet
Bda 1 Exp
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Dreep Proof Motor Leroy Somer
100% (2)
Dreep Proof Motor Leroy Somer
68 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
HADOOP
No ratings yet
HADOOP
4 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Hadoop
No ratings yet
Hadoop
3 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
5 pages
Neom Oxagon Local Control & Main PLC Panel Material Schedule 17-10-2022
No ratings yet
Neom Oxagon Local Control & Main PLC Panel Material Schedule 17-10-2022
1 page
02 - Smartscreen Handout 01 - Supplies & Earthing-1
No ratings yet
02 - Smartscreen Handout 01 - Supplies & Earthing-1
5 pages
DS R1520 V1.0.61-3
No ratings yet
DS R1520 V1.0.61-3
2 pages
OWASP CTI Presentation Calita Cristi
No ratings yet
OWASP CTI Presentation Calita Cristi
31 pages
林肯power Wave 455m
No ratings yet
林肯power Wave 455m
64 pages
ESD Discussion
No ratings yet
ESD Discussion
4 pages
Assignment 4: Self-Excited Compound-Wound DC Generator
No ratings yet
Assignment 4: Self-Excited Compound-Wound DC Generator
11 pages
Im 403
No ratings yet
Im 403
44 pages
Midmark Ultra-Series Dental Chairs: Models
No ratings yet
Midmark Ultra-Series Dental Chairs: Models
48 pages
Latitude 5540 Owners Manual en Us
No ratings yet
Latitude 5540 Owners Manual en Us
153 pages
Enigma Manual
No ratings yet
Enigma Manual
80 pages
Digital Magazine Anatomy
No ratings yet
Digital Magazine Anatomy
7 pages
Chapter 1 Introduction To Computer System
No ratings yet
Chapter 1 Introduction To Computer System
27 pages
FOP Unit 3
No ratings yet
FOP Unit 3
78 pages
Chess Project Report
No ratings yet
Chess Project Report
9 pages
Multitech MT9234ZBA Datasheet
No ratings yet
Multitech MT9234ZBA Datasheet
2 pages
Experiment 1.3: 1. Aim/Overview of The Practical: To Create An Application To Calculate Interest For FDS
No ratings yet
Experiment 1.3: 1. Aim/Overview of The Practical: To Create An Application To Calculate Interest For FDS
12 pages
Aliza Khokhar Coal Lab 4
No ratings yet
Aliza Khokhar Coal Lab 4
4 pages
Chijindu Damian MAC 101
No ratings yet
Chijindu Damian MAC 101
3 pages
Memorandum of Understanding (Mou) : Issac It Lab Solutions LLP
No ratings yet
Memorandum of Understanding (Mou) : Issac It Lab Solutions LLP
9 pages
20170829014125apple Iphone and The 4ps of Marketing
No ratings yet
20170829014125apple Iphone and The 4ps of Marketing
4 pages
Java JDBC PreparedStatement Example - HowToDoInJava
No ratings yet
Java JDBC PreparedStatement Example - HowToDoInJava
1 page
Driving in The Rain
No ratings yet
Driving in The Rain
2 pages
Date: - Lab Manual Chemical Process Optimization (Ch-404) Lab No: - Roll No
No ratings yet
Date: - Lab Manual Chemical Process Optimization (Ch-404) Lab No: - Roll No
2 pages
Faasos Food On Demand: Faasos Is An Indian "Food On Demand" Service That Was Incorporated in 2011. It Is One of The
No ratings yet
Faasos Food On Demand: Faasos Is An Indian "Food On Demand" Service That Was Incorporated in 2011. It Is One of The
2 pages
Abhinav Jha: Netaji Subhas University of Technology
No ratings yet
Abhinav Jha: Netaji Subhas University of Technology
1 page
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet

BDA Experiment1

Uploaded by

BDA Experiment1

Uploaded by

Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No.

Aim: Hadoop HDFS Practical:

Hadoop Ecosystem Tools –

Step 2: Click New in the top left corner.

Sr.No. Description Commands

You might also like