0% found this document useful (0 votes)
18 views8 pages

Install Hadoop

This document provides a step-by-step guide for installing Hadoop on a single-node Ubuntu cluster, starting with checking Java and Hadoop versions, updating the system, and installing Java. It covers downloading and configuring Hadoop, setting up environment variables, enabling SSH, and configuring Hadoop files for core, HDFS, MapReduce, and YARN settings. Finally, it details formatting the Namenode, starting Hadoop services, verifying running services, and accessing Hadoop web interfaces for monitoring.

Uploaded by

Gopika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

Install Hadoop

This document provides a step-by-step guide for installing Hadoop on a single-node Ubuntu cluster, starting with checking Java and Hadoop versions, updating the system, and installing Java. It covers downloading and configuring Hadoop, setting up environment variables, enabling SSH, and configuring Hadoop files for core, HDFS, MapReduce, and YARN settings. Finally, it details formatting the Namenode, starting Hadoop services, verifying running services, and accessing Hadoop web interfaces for monitoring.

Uploaded by

Gopika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Hadoop Installation on Ubuntu (Single Node Cluster)

Step 1: Check Java and Hadoop Versions

Before installing Hadoop, it is important to check whether Java is installed because


Hadoop is built on Java and requires it to function. The java -version command verifies the
Java installation, ensuring compatibility. Similarly, hadoop version checks if Hadoop is
already installed to avoid conflicts when setting up a new version. If Java is missing, we
must install it before proceeding with Hadoop installation.

Before installing Hadoop, ensure that Java is installed.

java -version

hadoop version

Step 2: Update and Upgrade the System

Running sudo apt update and sudo apt upgrade -y ensures that all system packages are up
to date. This step prevents dependency issues while installing new software like Java and
Hadoop. Updating the package list ensures we get the latest versions, and upgrading
applies security patches and software improvements.

Updating ensures that all the installed packages are up to date.

sudo apt update

sudo apt upgrade -y

The -y flag automatically confirms updates.

Step 3: Install Java

Hadoop requires Java to execute its processes. OpenJDK 11 is a stable, widely used version
that works well with Hadoop 3.x. By installing it with sudo apt install openjdk-11-jdk -y, we
ensure that Hadoop has the necessary Java runtime environment. This step is crucial
because, without Java, Hadoop will not function.

Hadoop requires Java to run. Install OpenJDK 11 using:

sudo apt install openjdk-11-jdk -y

java -version
Step 4: Download and Extract Hadoop

Hadoop is downloaded from Apache’s official website using wget. The command fetches
the Hadoop package (hadoop-3.3.6.tar.gz), which is then extracted using tar -xvzf. This
unpacks Hadoop into a directory. Finally, the extracted folder is moved to
/usr/local/hadoop, a common location for system-wide software installations. This makes
Hadoop easily accessible to all users on the system.

Download Hadoop from the official Apache website.

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Extract the downloaded file:

tar -xvzf hadoop-3.3.6.tar.gz

Move Hadoop to the /usr/local directory for system-wide access:

sudo mv hadoop-3.3.6 /usr/local/Hadoop

Step 5: Configure Environment Variables

After installation, we need to configure environment variables to make Hadoop and Java
easily executable from any terminal session. This is done by adding the Hadoop and Java
paths to ~/.bashrc. We define JAVA_HOME, HADOOP_HOME, PATH, and
HADOOP_CONF_DIR, ensuring that the system recognizes Hadoop commands without
requiring full paths.

Edit the ~/.bashrc file to set up Hadoop and Java paths.

nano ~/.bashrc

Add the following lines at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/Hadoop

Save and exit (Ctrl + X, then Y, then Enter).


Apply the changes:

Once the environment variables are added, they need to be applied for the current
session. Running source ~/.bashrc reloads the bash profile so that changes take effect
immediately without needing to restart the terminal. This ensures that any Hadoop-
related commands work as expected.
Source ~/.bashrc

Step 6: Enable SSH for Hadoop

Hadoop requires SSH (Secure Shell) for communication between nodes in a distributed
environment. Even in a single-node setup, SSH is needed to start and stop Hadoop services
without manually logging in each time. This step is essential because Hadoop’s daemons
interact over SSH.

To enable password-less SSH login, we generate an SSH key pair using ssh-keygen -t rsa -P
"" -f ~/.ssh/id_rsa. The public key is then added to the authorized_keys file using cat
~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys. This setup allows Hadoop daemons to
communicate securely without repeatedly asking for passwords, which is crucial for
automation.

Hadoop requires passwordless SSH access.

ssh localhost

If SSH is not installed, install it using:

sudo apt install ssh -y

Generate SSH keys and configure passwordless SSH:

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

chmod 600 ~/.ssh/authorized_keys

Now, verify SSH:

ssh localhost

Step 7: Configure Hadoop Files

Core-Site Configuration:

The core-site.xml file specifies Hadoop’s core settings. The <fs.defaultFS> property is set to
hdfs://localhost:9000, defining the default Hadoop filesystem as HDFS. The
<hadoop.tmp.dir> property sets a temporary directory for Hadoop’s intermediate
operations. This configuration is necessary to initialize and manage HDFS correctly.

Edit the core-site.xml file:

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following content:


<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

<property>

<name>hadoop.tmp.dir</name>

<value>/usr/local/hadoop/tmp</value>

<description>A base directory for HDFS and other temporary files.</description>

</property>

</configuration>

Save and exit.

HDFS-Site Configuration:

The hdfs-site.xml file configures the Hadoop Distributed File System (HDFS). The
<dfs.replication> property is set to 1, meaning each file block is stored only once, which is
ideal for a single-node setup. <dfs.namenode.name.dir> and <dfs.datanode.data.dir>
specify directories for storing metadata and actual file data, ensuring proper data
organization.

Edit the hdfs-site.xml file:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following content:

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

<description>Number of replicas for HDFS blocks (set to 1 for single-node


cluster).</description>

</property>
<property>

<name>dfs.namenode.name.dir</name>

<value>file:///usr/local/hadoop/hdfs/namenode</value>

<description>Directory for Namenode metadata.</description>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///usr/local/hadoop/hdfs/datanode</value>

<description>Directory for Datanode storage.</description>

</property>

</configuration>

Save and exit

MapReduce Configuration:

This file configures the MapReduce framework in Hadoop. The


<mapreduce.framework.name> property is set to yarn, meaning Hadoop will use YARN to
manage computational resources. <mapreduce.jobhistory.address> is set to
localhost:10020, enabling the job history server to track completed MapReduce jobs. This
configuration is essential for executing and monitoring Hadoop jobs.

Edit the mapred-site.xml file:

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following content:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>
<property>

<name>mapreduce.jobhistory.address</name>

<value>localhost:10020</value>

</property>

</configuration>

Save and exit

YARN Configuration:

The yarn-site.xml file sets up YARN, the resource management layer of Hadoop.
<yarn.resourcemanager.hostname> is set to localhost, defining where the
ResourceManager will run. <yarn.nodemanager.aux-services> is set to mapreduce_shuffle,
enabling data shuffling for MapReduce jobs. These settings ensure that YARN efficiently
schedules and executes tasks.

Edit the yarn-site.xml file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following content:

<configuration>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>localhost</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Save and exit


Step 8: Format Namenode:

Before starting Hadoop for the first time, the NameNode must be formatted using hdfs
namenode -format. This command initializes the HDFS metadata and clears any previous
data. Without formatting, the system might face inconsistencies, preventing Hadoop from
functioning correctly. This step is only required for the first setup.

Before starting Hadoop, format the HDFS Namenode:

hdfs namenode -format

Step 9: Start Hadoop Services:

To launch Hadoop, we run start-dfs.sh to start HDFS services (NameNode and DataNode)
and start-yarn.sh to start YARN (ResourceManager and NodeManager). These scripts
initialize the distributed storage and resource management layers of Hadoop. Running
them ensures that the cluster is up and ready for processing tasks.

Start the Hadoop Distributed File System (HDFS):

start-dfs.sh

start-yarn.sh

Step 10: Verify Running Services:

After setting up Hadoop, we use the jps command to list all running Java processes. This
helps verify if essential Hadoop daemons like NameNode, DataNode, ResourceManager,
and NodeManager are running properly. If any service is missing, troubleshooting is
needed before proceeding.

Check if Hadoop processes are running:

Jps

Expected output:

 NameNode

 DataNode

 SecondaryNameNode

 ResourceManager

 NodeManager
Step 11: Hadoop Web Interfaces

 You can access the following Hadoop web UIs:

Service URL Description


NameNode UI http://localhost:9870/ Shows HDFS file system status.

ResourceManager UI http://localhost:8088/ Monitors running applications in


YARN.

DataNode UI http://localhost:9864/ Displays DataNode status.

NodeManager UI http://localhost:8042/ Shows NodeManager details.

Hadoop provides web interfaces for real-time monitoring:

 NameNode UI (http://localhost:9870/): Shows HDFS status, including storage


capacity and active nodes.
 ResourceManager UI (http://localhost:8088/): Displays running and completed
YARN applications.
 DataNode UI (http://localhost:9864/): Monitors individual DataNode health.
 NodeManager UI (http://localhost:8042/): Shows the status of compute nodes.

These web UIs are useful for troubleshooting and observing cluster activity.

You might also like