0% found this document useful (0 votes)

63 views14 pages

Experiment No - 01

The document compares different versions of Hadoop (1.x, 2.x, and 3.x) and their architectures, resource management, multi-tenancy, high availability, compatibility, security, scalability, containerization, erasure coding, NameNode federation, and performance optimizations. It also provides background information on Hadoop, including its core components HDFS and MapReduce, evolution, features, applications, security, ecosystem, and management tools.

Uploaded by

AYAAN Satkut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views14 pages

Experiment No - 01

Uploaded by

AYAAN Satkut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Exp 1

Aim: To compare different versions of Hadoop( Hadoop 1.x, Hadoop 2.x, and Hadoop 3. x).
Also setup Hadoop 1.x single node cluster.

Theory:
Comparison of different versions of Hadoop( Hadoop 1.x, Hadoop 2.x, and
Hadoop 3. x):

Architecture and Resource Management:

Hadoop 1.x: The architecture revolves around the MapReduce programming model, where
processing and resource management are closely tied. HDFS serves as the underlying storage
system.
Hadoop 2.x: YARN is introduced to separate resource management from job execution. It
enables multiple applications to share resources efficiently, supporting not only MapReduce but
also other processing engines.
Hadoop 3.x: Improves YARN for better resource utilization, scalability, and support for diverse
workloads beyond batch processing.

Multi-tenancy:

Hadoop 1.x: Primarily designed for single-purpose batch processing jobs with limited support for
concurrent workloads.
Hadoop 2.x: Introduces better multi-tenancy, allowing various workloads like batch, interactive,
and real-time processing to coexist on the same cluster.
Hadoop 3.x: Enhances multi-tenancy features, ensuring efficient resource sharing and isolation
among different applications.

High Availability:

Hadoop 1.x: Limited high availability features, with potential single points of failure.
Hadoop 2.x: Introduces high availability for HDFS and ResourceManager, reducing the risk of
downtime.
Hadoop 3.x: Builds on previous high availability mechanisms, further improving system reliability
and fault tolerance.

Compatibility:

Hadoop 1.x: Primarily designed for MapReduce, limited compatibility with non-MapReduce
applications.
Hadoop 2.x: Enhances compatibility, supporting various processing engines like Apache Spark
and Apache Flink alongside MapReduce.
Hadoop 3.x: Continues to improve compatibility, ensuring seamless integration with emerging
big data frameworks and libraries.
Security:

Hadoop 1.x: Basic security features, with limited authentication and authorization.
Hadoop 2.x: Introduces Hadoop Secure Mode with Kerberos authentication, enhancing security.
Hadoop 3.x: Builds upon previous security features, addressing vulnerabilities, and introducing
encryption improvements for data at rest and in transit.

Scalability:

Hadoop 1.x: Limited scalability for large-scale data processing due to the monolithic
architecture.
Hadoop 2.x: Scales better with YARN, allowing clusters to grow and adapt to increasing
workloads and data volumes.
Hadoop 3.x: Continues to improve scalability, addressing performance bottlenecks and
optimizing resource allocation for larger deployments.

Containerization:

Hadoop 1.x: Limited support for containerization, typically reliant on physical or virtual
machines.
Hadoop 2.x: Embraces containerization with YARN, allowing the use of Docker containers for
application execution, providing better isolation and resource utilization.
Hadoop 3.x: Further refines containerization support, optimizing resource utilization and
providing more flexibility in managing dependencies.

Erasure Coding:

Hadoop 1.x: Does not support erasure coding, relying on replication for fault tolerance.
Hadoop 2.x: Introduces erasure coding in HDFS to improve storage efficiency by reducing
replication overhead.
Hadoop 3.x: Enhances erasure coding capabilities, making it more configurable and efficient,
contributing to better storage utilization.

NameNode Federation:

Hadoop 1.x: Utilizes a single NameNode architecture, which can become a potential bottleneck.
Hadoop 2.x: Introduces High Availability with multiple NameNodes for improved fault tolerance.
Hadoop 3.x: Continues to refine and optimize NameNode High Availability, ensuring system
robustness, especially in scenarios with large-scale deployments.

Performance and Optimization:

Hadoop 1.x: Limited performance optimizations, potentially leading to inefficiencies in resource
usage and job execution.
Hadoop 2.x: Implements various performance improvements, addressing bottlenecks and
enhancing job execution speed through better resource management.
Hadoop 3.x: Continues to focus on performance enhancements, addressing issues, optimizing
resource utilization, and catering to the needs of larger-scale deployments with more efficient
processing.

Hadoop facts:
Hadoop, an open-source framework designed for distributed storage and processing of large
datasets, has become a cornerstone in the realm of big data analytics. Originating from the
Apache Software Foundation, Hadoop provides a scalable and fault-tolerant ecosystem that
empowers organizations to manage, store, and analyze vast amounts of data efficiently.

At the heart of Hadoop lies the Hadoop Distributed File System (HDFS), a distributed storage
system that allows data to be stored across multiple machines in a cluster. This architecture
ensures high availability and fault tolerance by replicating data across various nodes. HDFS
divides large files into smaller blocks, typically 128 megabytes or 256 megabytes in size,
distributing them across the cluster. This distributed storage model facilitates parallel processing
and efficient data retrieval for analytical purposes.

The fundamental processing engine in Hadoop is MapReduce, a programming model and

processing engine designed for large-scale data processing. MapReduce divides a
computational task into two phases - the Map phase and the Reduce phase. In the Map phase,
data is processed and transformed into key-value pairs, while the Reduce phase aggregates
and analyzes these pairs to generate meaningful insights. This parallel processing model is
highly scalable, enabling Hadoop to handle massive datasets by distributing tasks across
multiple nodes in the cluster.

Hadoop's evolution has seen significant milestones with different versions bringing new features
and improvements. Hadoop 1.x marked the inception, introducing the basic HDFS and
MapReduce components. However, it had limitations in terms of scalability and lacked some
essential features, which led to the development of Hadoop 2.x.

Hadoop 2.x, a major leap forward, introduced the YARN (Yet Another Resource Negotiator)
architecture. YARN decoupled the resource management and job scheduling capabilities from
MapReduce, allowing various processing engines to coexist on the same cluster. This
enhanced multi-tenancy, enabling not only batch processing but also real-time and interactive
analytics. Hadoop 2.x also brought high availability to HDFS and ResourceManager, improving
system robustness.
Building upon the successes of Hadoop 2.x, Hadoop 3.x continued to refine and enhance the
ecosystem. It introduced features like erasure coding in HDFS to improve storage efficiency,
containerization support for better resource utilization, and various performance optimizations.
Hadoop 3.x maintained compatibility with its predecessors while addressing their limitations and
embracing emerging big data technologies.

One of the critical features of Hadoop is its adaptability to various data types and structures. It
accommodates structured and unstructured data, making it suitable for a wide range of
applications, from traditional data warehousing to modern data science and machine learning
tasks. Hadoop's flexibility enables organizations to extract value from diverse data sources,
fostering innovation and informed decision-making.

Security is another pivotal aspect of Hadoop's architecture. Hadoop 2.x introduced the Hadoop
Secure Mode, incorporating Kerberos authentication for user verification and access control.
Hadoop 3.x continued to strengthen security, addressing vulnerabilities and enhancing
encryption for data at rest and in transit. These security measures ensure the integrity and
confidentiality of data, making Hadoop a reliable platform for enterprises dealing with sensitive
information.

As organizations embrace Hadoop, they often deploy it in a clustered environment, where

multiple machines work together to process and analyze data. Clusters can range from small
setups with a few nodes to large-scale configurations with thousands of nodes. This scalability
allows Hadoop to handle diverse workloads, from small-scale data processing tasks to massive,
enterprise-level analytics.

In addition to its core components, Hadoop has a rich ecosystem of tools and frameworks that
extend its capabilities. Apache Hive provides a SQL-like interface for querying and analyzing
data stored in Hadoop, making it accessible to users familiar with relational databases. Apache
Pig simplifies the development of complex data processing tasks through a high-level scripting
language. Apache Spark, although not originally part of Hadoop, has become an integral part of
the ecosystem, offering in-memory processing and improved performance over MapReduce.

The Hadoop ecosystem also includes Apache HBase for NoSQL database capabilities, Apache
ZooKeeper for distributed coordination, Apache Sqoop for data integration, and many more.
This extensive set of tools caters to diverse use cases and ensures that Hadoop remains a
versatile and comprehensive solution for big data processing.

The deployment and management of Hadoop clusters have been facilitated by various tools and
platforms. Apache Ambari and Cloudera Manager are examples of management tools that
simplify the installation, configuration, and monitoring of Hadoop clusters. Cloud-based solutions
like Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight provide managed
Hadoop services, enabling organizations to leverage the power of Hadoop without the
complexities of cluster administration.
Output:

Installing java

First, create a new user named hadoop:

Now, use the following command to generate private and public keys:

Now, add the public key to authorized_keys:

Use the chmod command to change the file permissions of authorized_keys:

Finally, verify the SSH configuration:

Step 3: Download and install Apache Hadoop on Ubuntu
Now, visit the download page for Apache Hadoop and copy the link for the most recent stable
release.

Next, move the extracted file to the /usr/local/hadoop using the following command:

Now, create a directory using mkdir command to store logs:

Finally, change the ownership of the /usr/local/hadoop to the user hadoop:

Step 4: Configure Hadoop on Ubuntu

To enable the changes, source the .bashrc file:

Step 5: Configure java environment variables

To use Hadoop, you are required to enable its core functions which include YARN, HDFS,
MapReduce.
To do that, you will have to define java environment variables in hadoop-env.sh file.

Next, change your current working directory to /usr/local/hadoop/lib and download the javax
activation file:

Once done, check the Hadoop version in Ubuntu:

Next, you will have to edit the core-site.xml file to specify the URL for the name node:

Next, create a directory to store node metadata using the following command:

And change the ownership of the created directory to the hadoop user:

By configuring the hdfs-site.xml file, you will define the location for storing node metadata, fs-
image file.
By editing the mapred-site.xml file, you can define the MapReduce values.

This is the last configuration file that needs to be edited to use the Hadoop service.
Finally, use the following command to validate the Hadoop configuration and to format the
HDFS NameNode:

Step 6: Start the Hadoop cluster

To start the Hadoop cluster, you will have to start the previously configured nodes.

ayaan@ayaan-virtual-machine:/usr/local/hadoop/lib$ start-dfs.sh
Starting namenodes on [0.0.0.0]
0.0.0.0: Warning: Permanently added '0.0.0.0' (ED25519) to the list of known hosts.
Starting datanodes
Starting secondary namenodes [ayaan-virtual-machine]
ayaan-virtual-machine: Warning: Permanently added 'ayaan-virtual-machine' (ED25519) to the
list of known hosts.

Next, start the node manager and resource manager:

To verify whether the services are running as intended, use the following command:

ayaan@ayaan-virtual-machine:/usr/local/hadoop/lib$ jps
9568 ResourceManager
9697 NodeManager
9284 SecondaryNameNode
10118 Jps
8906 NameNode
9039 DataNode

Step 7: Access the Hadoop web interface

Conclusion:
We have explored the fundamental functionalities of Hadoop, delved into different versions, and
gained insights into the distinctions among them. Additionally, we embarked on the
comprehensive journey of installing Hadoop on our Ubuntu operating system, configuring it for a
single-node setup. Through this installation process, we obtained a firsthand look at the Hadoop
web interface, providing us with an overview of system statistics.

Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Big Data & Apache Hadoop: Click To Add Text
No ratings yet
Big Data & Apache Hadoop: Click To Add Text
37 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Bda 201070046 01
No ratings yet
Bda 201070046 01
24 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Fullstack Web Development Syllabus
100% (1)
Fullstack Web Development Syllabus
18 pages
Unit III
No ratings yet
Unit III
15 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Experiment 01 PDF
No ratings yet
Experiment 01 PDF
6 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Hadoop Main
No ratings yet
Hadoop Main
19 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
BDA 3rd Unit QB
No ratings yet
BDA 3rd Unit QB
4 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Day 2 S1 Intro - To - Hadoop - Ashok
No ratings yet
Day 2 S1 Intro - To - Hadoop - Ashok
27 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Big Data
No ratings yet
Big Data
27 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Week 5 Researchpaper
No ratings yet
Week 5 Researchpaper
7 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
HADOOP
No ratings yet
HADOOP
10 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Unit 2
No ratings yet
Unit 2
7 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Hadoop
No ratings yet
Hadoop
7 pages
Introduction To
No ratings yet
Introduction To
7 pages
Hadoop - Presentation 101
No ratings yet
Hadoop - Presentation 101
10 pages
A Short Tutorial On User Exits
100% (1)
A Short Tutorial On User Exits
15 pages
DX200 Maintainence
No ratings yet
DX200 Maintainence
1,166 pages
ECE 3rd Semester - Digital System Design - EC3352 - Lab Manual
No ratings yet
ECE 3rd Semester - Digital System Design - EC3352 - Lab Manual
57 pages
Wireless Communication and Mobile Computing
100% (1)
Wireless Communication and Mobile Computing
2 pages
EE5469 - Elements of Industrial Automation 2018 Curriculum
No ratings yet
EE5469 - Elements of Industrial Automation 2018 Curriculum
9 pages
CS304 Short Notes
No ratings yet
CS304 Short Notes
7 pages
Data Science Notes - TutorialsDuniya
No ratings yet
Data Science Notes - TutorialsDuniya
59 pages
Technical Aptitude Questions Ebook
No ratings yet
Technical Aptitude Questions Ebook
175 pages
Pointer and Array Review & Introduction To Data Structure
No ratings yet
Pointer and Array Review & Introduction To Data Structure
39 pages
CV090109
No ratings yet
CV090109
2 pages
Red Hat Enterprise Linux 10
No ratings yet
Red Hat Enterprise Linux 10
68 pages
Data Server For FFFTP Eng
No ratings yet
Data Server For FFFTP Eng
18 pages
Balalo Norbert UNIT 4 PROGRAMMING - Assignment
No ratings yet
Balalo Norbert UNIT 4 PROGRAMMING - Assignment
30 pages
OS Ch02
No ratings yet
OS Ch02
44 pages
Citra Log
No ratings yet
Citra Log
7 pages
Pairip Updated
No ratings yet
Pairip Updated
3 pages
X-DA4125 High Efficiency Power Amplifier
No ratings yet
X-DA4125 High Efficiency Power Amplifier
3 pages
MS Sharepoint Content Deployment: ISSUES and Fixes
No ratings yet
MS Sharepoint Content Deployment: ISSUES and Fixes
14 pages
TS Manual 2.5
No ratings yet
TS Manual 2.5
137 pages
HD 2000 Hardware Guide
No ratings yet
HD 2000 Hardware Guide
20 pages
Neutral Point Clamped and Cascaded H-Bridge Multilevel Inverter Topologies - A Comparison
No ratings yet
Neutral Point Clamped and Cascaded H-Bridge Multilevel Inverter Topologies - A Comparison
8 pages
D C S 1 A: Jquery Program To Apply Borders To Text Area and Paragraphs N: C: R N
No ratings yet
D C S 1 A: Jquery Program To Apply Borders To Text Area and Paragraphs N: C: R N
34 pages
Se Three Phase Inverter Datasheet
No ratings yet
Se Three Phase Inverter Datasheet
2 pages
Coding Area: Roman Iteration
No ratings yet
Coding Area: Roman Iteration
3 pages
Creative Stage Air Experience Guide V2
No ratings yet
Creative Stage Air Experience Guide V2
4 pages
Aws Routing Table
No ratings yet
Aws Routing Table
11 pages
Project Report (2012-2013) : Bachelor of Commerce & Computer Application
No ratings yet
Project Report (2012-2013) : Bachelor of Commerce & Computer Application
59 pages
Joy - SDN BHD No.25, Jalan 7 Sungai Besi, Taman Bukit Piawai 43000, Kuala Selangor
No ratings yet
Joy - SDN BHD No.25, Jalan 7 Sungai Besi, Taman Bukit Piawai 43000, Kuala Selangor
16 pages
LM 2936M - Regulador
No ratings yet
LM 2936M - Regulador
9 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet

Experiment No - 01

Uploaded by

Experiment No - 01

Uploaded by

Exp 1

Architecture and Resource Management:

Performance and Optimization:

The fundamental processing engine in Hadoop is MapReduce, a programming model and

As organizations embrace Hadoop, they often deploy it in a clustered environment, where

First, create a new user named hadoop:

Now, add the public key to authorized_keys:

Use the chmod command to change the file permissions of authorized_keys:

Finally, verify the SSH configuration:

Now, create a directory using mkdir command to store logs:

Finally, change the ownership of the /usr/local/hadoop to the user hadoop:

To enable the changes, source the .bashrc file:

Step 5: Configure java environment variables

Once done, check the Hadoop version in Ubuntu:

Step 6: Start the Hadoop cluster

Next, start the node manager and resource manager:

Step 7: Access the Hadoop web interface

You might also like