Experiment No - 01
Experiment No - 01
Aim: To compare different versions of Hadoop( Hadoop 1.x, Hadoop 2.x, and Hadoop 3. x).
Also setup Hadoop 1.x single node cluster.
Theory:
Comparison of different versions of Hadoop( Hadoop 1.x, Hadoop 2.x, and
Hadoop 3. x):
Hadoop 1.x: The architecture revolves around the MapReduce programming model, where
processing and resource management are closely tied. HDFS serves as the underlying storage
system.
Hadoop 2.x: YARN is introduced to separate resource management from job execution. It
enables multiple applications to share resources efficiently, supporting not only MapReduce but
also other processing engines.
Hadoop 3.x: Improves YARN for better resource utilization, scalability, and support for diverse
workloads beyond batch processing.
Multi-tenancy:
Hadoop 1.x: Primarily designed for single-purpose batch processing jobs with limited support for
concurrent workloads.
Hadoop 2.x: Introduces better multi-tenancy, allowing various workloads like batch, interactive,
and real-time processing to coexist on the same cluster.
Hadoop 3.x: Enhances multi-tenancy features, ensuring efficient resource sharing and isolation
among different applications.
High Availability:
Hadoop 1.x: Limited high availability features, with potential single points of failure.
Hadoop 2.x: Introduces high availability for HDFS and ResourceManager, reducing the risk of
downtime.
Hadoop 3.x: Builds on previous high availability mechanisms, further improving system reliability
and fault tolerance.
Compatibility:
Hadoop 1.x: Primarily designed for MapReduce, limited compatibility with non-MapReduce
applications.
Hadoop 2.x: Enhances compatibility, supporting various processing engines like Apache Spark
and Apache Flink alongside MapReduce.
Hadoop 3.x: Continues to improve compatibility, ensuring seamless integration with emerging
big data frameworks and libraries.
Security:
Hadoop 1.x: Basic security features, with limited authentication and authorization.
Hadoop 2.x: Introduces Hadoop Secure Mode with Kerberos authentication, enhancing security.
Hadoop 3.x: Builds upon previous security features, addressing vulnerabilities, and introducing
encryption improvements for data at rest and in transit.
Scalability:
Hadoop 1.x: Limited scalability for large-scale data processing due to the monolithic
architecture.
Hadoop 2.x: Scales better with YARN, allowing clusters to grow and adapt to increasing
workloads and data volumes.
Hadoop 3.x: Continues to improve scalability, addressing performance bottlenecks and
optimizing resource allocation for larger deployments.
Containerization:
Hadoop 1.x: Limited support for containerization, typically reliant on physical or virtual
machines.
Hadoop 2.x: Embraces containerization with YARN, allowing the use of Docker containers for
application execution, providing better isolation and resource utilization.
Hadoop 3.x: Further refines containerization support, optimizing resource utilization and
providing more flexibility in managing dependencies.
Erasure Coding:
Hadoop 1.x: Does not support erasure coding, relying on replication for fault tolerance.
Hadoop 2.x: Introduces erasure coding in HDFS to improve storage efficiency by reducing
replication overhead.
Hadoop 3.x: Enhances erasure coding capabilities, making it more configurable and efficient,
contributing to better storage utilization.
NameNode Federation:
Hadoop 1.x: Utilizes a single NameNode architecture, which can become a potential bottleneck.
Hadoop 2.x: Introduces High Availability with multiple NameNodes for improved fault tolerance.
Hadoop 3.x: Continues to refine and optimize NameNode High Availability, ensuring system
robustness, especially in scenarios with large-scale deployments.
Hadoop facts:
Hadoop, an open-source framework designed for distributed storage and processing of large
datasets, has become a cornerstone in the realm of big data analytics. Originating from the
Apache Software Foundation, Hadoop provides a scalable and fault-tolerant ecosystem that
empowers organizations to manage, store, and analyze vast amounts of data efficiently.
At the heart of Hadoop lies the Hadoop Distributed File System (HDFS), a distributed storage
system that allows data to be stored across multiple machines in a cluster. This architecture
ensures high availability and fault tolerance by replicating data across various nodes. HDFS
divides large files into smaller blocks, typically 128 megabytes or 256 megabytes in size,
distributing them across the cluster. This distributed storage model facilitates parallel processing
and efficient data retrieval for analytical purposes.
Hadoop's evolution has seen significant milestones with different versions bringing new features
and improvements. Hadoop 1.x marked the inception, introducing the basic HDFS and
MapReduce components. However, it had limitations in terms of scalability and lacked some
essential features, which led to the development of Hadoop 2.x.
Hadoop 2.x, a major leap forward, introduced the YARN (Yet Another Resource Negotiator)
architecture. YARN decoupled the resource management and job scheduling capabilities from
MapReduce, allowing various processing engines to coexist on the same cluster. This
enhanced multi-tenancy, enabling not only batch processing but also real-time and interactive
analytics. Hadoop 2.x also brought high availability to HDFS and ResourceManager, improving
system robustness.
Building upon the successes of Hadoop 2.x, Hadoop 3.x continued to refine and enhance the
ecosystem. It introduced features like erasure coding in HDFS to improve storage efficiency,
containerization support for better resource utilization, and various performance optimizations.
Hadoop 3.x maintained compatibility with its predecessors while addressing their limitations and
embracing emerging big data technologies.
One of the critical features of Hadoop is its adaptability to various data types and structures. It
accommodates structured and unstructured data, making it suitable for a wide range of
applications, from traditional data warehousing to modern data science and machine learning
tasks. Hadoop's flexibility enables organizations to extract value from diverse data sources,
fostering innovation and informed decision-making.
Security is another pivotal aspect of Hadoop's architecture. Hadoop 2.x introduced the Hadoop
Secure Mode, incorporating Kerberos authentication for user verification and access control.
Hadoop 3.x continued to strengthen security, addressing vulnerabilities and enhancing
encryption for data at rest and in transit. These security measures ensure the integrity and
confidentiality of data, making Hadoop a reliable platform for enterprises dealing with sensitive
information.
In addition to its core components, Hadoop has a rich ecosystem of tools and frameworks that
extend its capabilities. Apache Hive provides a SQL-like interface for querying and analyzing
data stored in Hadoop, making it accessible to users familiar with relational databases. Apache
Pig simplifies the development of complex data processing tasks through a high-level scripting
language. Apache Spark, although not originally part of Hadoop, has become an integral part of
the ecosystem, offering in-memory processing and improved performance over MapReduce.
The Hadoop ecosystem also includes Apache HBase for NoSQL database capabilities, Apache
ZooKeeper for distributed coordination, Apache Sqoop for data integration, and many more.
This extensive set of tools caters to diverse use cases and ensures that Hadoop remains a
versatile and comprehensive solution for big data processing.
The deployment and management of Hadoop clusters have been facilitated by various tools and
platforms. Apache Ambari and Cloudera Manager are examples of management tools that
simplify the installation, configuration, and monitoring of Hadoop clusters. Cloud-based solutions
like Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight provide managed
Hadoop services, enabling organizations to leverage the power of Hadoop without the
complexities of cluster administration.
Output:
Installing java
Next, move the extracted file to the /usr/local/hadoop using the following command:
Next, change your current working directory to /usr/local/hadoop/lib and download the javax
activation file:
Next, create a directory to store node metadata using the following command:
And change the ownership of the created directory to the hadoop user:
By configuring the hdfs-site.xml file, you will define the location for storing node metadata, fs-
image file.
By editing the mapred-site.xml file, you can define the MapReduce values.
This is the last configuration file that needs to be edited to use the Hadoop service.
Finally, use the following command to validate the Hadoop configuration and to format the
HDFS NameNode:
ayaan@ayaan-virtual-machine:/usr/local/hadoop/lib$ start-dfs.sh
Starting namenodes on [0.0.0.0]
0.0.0.0: Warning: Permanently added '0.0.0.0' (ED25519) to the list of known hosts.
Starting datanodes
Starting secondary namenodes [ayaan-virtual-machine]
ayaan-virtual-machine: Warning: Permanently added 'ayaan-virtual-machine' (ED25519) to the
list of known hosts.
ayaan@ayaan-virtual-machine:/usr/local/hadoop/lib$ jps
9568 ResourceManager
9697 NodeManager
9284 SecondaryNameNode
10118 Jps
8906 NameNode
9039 DataNode