0% found this document useful (0 votes)
1 views

BDA Lab File

The document outlines the installation process of Hadoop in two modes: Pseudo Distributed Mode, which runs on a single machine for development and testing, and Fully Distributed Mode, which operates on a cluster of multiple machines for production environments. It details the requirements and step-by-step procedures for setting up Hadoop, including configuring environment variables, modifying configuration files, and starting Hadoop daemons. Additionally, it highlights the differences between the two modes in terms of deployment, use case, resource requirements, performance, fault tolerance, and complexity.

Uploaded by

naustile69
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

BDA Lab File

The document outlines the installation process of Hadoop in two modes: Pseudo Distributed Mode, which runs on a single machine for development and testing, and Fully Distributed Mode, which operates on a cluster of multiple machines for production environments. It details the requirements and step-by-step procedures for setting up Hadoop, including configuring environment variables, modifying configuration files, and starting Hadoop daemons. Additionally, it highlights the differences between the two modes in terms of deployment, use case, resource requirements, performance, fault tolerance, and complexity.

Uploaded by

naustile69
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Department of Information Technology

Hadoop Installation

Aim: Perform the settings up and Installation of Hadoop in its’ two operation modes
1. Pseudo distributed mode
2. Fully distributed mode

1. Pseudo Distributed Mode

Definition: In Pseudo Distributed Mode, Hadoop runs on a single machine (a single node), but
all the Hadoop daemons (such as Name Node, Data Node, Resource Manager, Node Manager)
run as separate processes on the same machine. It simulates a distributed environment while
running on one machine, making it suitable for development and testing.

Requirements:

• A single machine or VM with Linux or Mac OS (or Windows with Cygwin).


• Java installed (version 8 or later is usually preferred).
• SSH access configured between local host and the machine (even if it's a single
machine).
• Sufficient memory and storage space (at least 4GB RAM, 20GB+ disk space).

Steps to Install in Pseudo Distributed Mode:

1. Install Java:

Hadoop requires Java to run, so you need to install the correct version of Java (usually Java 8
or later).

sudo apt-get install openjdk-8-jdk

Verify installation:
java -version

2. Download Hadoop:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

3. Extract Hadoop:

tar -xzvf hadoop-3.3.0.tar.gz


mv hadoop-3.3.0 /usr/local/hadoop

Harsimran Singh, 2237869


Department of Information Technology

4. Set Environment Variables: Add Hadoop and Java environment variables in ~/.bashrc
or ~/.bash_profile:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME

5. Configure Hadoop: Modify Hadoop configuration files in the


$HADOOP_HOME/etc/hadoop/ directory:
o core-site.xml: Add the following configuration:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

o hdfs-site.xml: Define the directories for HDFS:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hdfs/datanode</value>
</property>
</configuration>

o mapred-site.xml (Optional if running MapReduce):

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

6. Format the Hadoop file system: Format the HDFS Nam Node to initialize the file
system.

Harsimran Singh, 2237869


Department of Information Technology

7. Start Hadoop Daemons:


o Start the HDFS and YARN daemons:

start-dfs.sh
start-yarn.sh

8. Verify Installation:
o Access the Hadoop Web UI at http://localhost:50070 for HDFS and
http://localhost:8088 for YARN.
o Run a sample Hadoop job (e.g., word count) to verify the cluster functionality.

2. Fully Distributed Mode

Definition: In Fully Distributed Mode, Hadoop runs on a cluster of multiple machines (nodes).
Each machine runs its own set of daemons (Name Node, Data Node, Resource Manager, Node
Manager). This mode is intended for production environments and requires proper
configuration and management of multiple machines, ensuring data replication, fault tolerance,
and high availability.

Requirements:

• Multiple machines (minimum 3 nodes recommended).


• Each machine should have SSH access configured.
• Java installed on all nodes.
• Network communication should be set up between all the machines.
• Hadoop should be configured to use multiple nodes (Name Node, Dat aNode, etc.).

Steps to Install in Fully Distributed Mode:

1. Set Up Network and SSH:

• Ensure all the machines have a unique hostname and proper network
connectivity.
• Set up passwordless SSH between all nodes in the cluster (using ssh-keygen
and copying the public key to all machines).

2. Install Java:

• Install Java on all nodes (similar to Pseudo Distributed Mode).

3. Install Hadoop on All Nodes:

• Follow the same process of downloading and extracting Hadoop on each node
in the cluster.

4. Configure Hadoop on Master Node (Name Node and Resource Manager):

• Similar to the steps above, configure the core-site.xml, hdfs-site.xml, and


mapred-site.xml on the master node (usually the first node in the cluster).

Harsimran Singh, 2237869


Department of Information Technology

5. Configure Hadoop on Slave Nodes (Data Nodes and Node Managers):

• The configuration on slave nodes is very similar, but you will need to specify
the master node's Name Node URI in the core-site.xml file.

6. Format the Name Node: On the Name Node (master), format the HDFS:

hdfs name node –format

7. Start Hadoop Cluster:

• Start the Hadoop daemons on all nodes.


• On the master node:

start-dfs.sh
start-yarn.sh

• On the slave nodes, run:

start-dfs.sh
start-yarn.sh

8. Verify the Cluster:

• Check the Hadoop Web UI on the master node (http://master-node:50070 for


HDFS and http://master-node:8088 for YARN).
• Ensure that the cluster is functioning with all nodes listed.
9. Differences Between Pseudo Distributed Mode and Fully Distributed
Mode:
Feature Pseudo Distributed Mode Fully Distributed Mode
Deployment Runs on a single machine. Runs on a cluster of multiple machines.
Development, testing, learning. Production environments with fault
Use Case tolerance.
Hadoop All daemons run on the same Each daemon runs on different
Daemons machine. machines.
Resource Single machine with sufficient Multiple machines, each with sufficient
Requirement resources. resources.
Limited by the hardware of the Performance scales with the number of
Performance
single machine. machines.
No fault tolerance (single node
Fault Tolerance failure affects everything). High availability and fault tolerance.
Easier to set up, minimal More complex, requires proper setup of
Complexity
configuration. nodes, networking, and configurations.

Harsimran Singh, 2237869

You might also like