Install Hadoop
Install Hadoop
java -version
hadoop version
Running sudo apt update and sudo apt upgrade -y ensures that all system packages are up
to date. This step prevents dependency issues while installing new software like Java and
Hadoop. Updating the package list ensures we get the latest versions, and upgrading
applies security patches and software improvements.
Hadoop requires Java to execute its processes. OpenJDK 11 is a stable, widely used version
that works well with Hadoop 3.x. By installing it with sudo apt install openjdk-11-jdk -y, we
ensure that Hadoop has the necessary Java runtime environment. This step is crucial
because, without Java, Hadoop will not function.
java -version
Step 4: Download and Extract Hadoop
Hadoop is downloaded from Apache’s official website using wget. The command fetches
the Hadoop package (hadoop-3.3.6.tar.gz), which is then extracted using tar -xvzf. This
unpacks Hadoop into a directory. Finally, the extracted folder is moved to
/usr/local/hadoop, a common location for system-wide software installations. This makes
Hadoop easily accessible to all users on the system.
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
After installation, we need to configure environment variables to make Hadoop and Java
easily executable from any terminal session. This is done by adding the Hadoop and Java
paths to ~/.bashrc. We define JAVA_HOME, HADOOP_HOME, PATH, and
HADOOP_CONF_DIR, ensuring that the system recognizes Hadoop commands without
requiring full paths.
nano ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/Hadoop
Once the environment variables are added, they need to be applied for the current
session. Running source ~/.bashrc reloads the bash profile so that changes take effect
immediately without needing to restart the terminal. This ensures that any Hadoop-
related commands work as expected.
Source ~/.bashrc
Hadoop requires SSH (Secure Shell) for communication between nodes in a distributed
environment. Even in a single-node setup, SSH is needed to start and stop Hadoop services
without manually logging in each time. This step is essential because Hadoop’s daemons
interact over SSH.
To enable password-less SSH login, we generate an SSH key pair using ssh-keygen -t rsa -P
"" -f ~/.ssh/id_rsa. The public key is then added to the authorized_keys file using cat
~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys. This setup allows Hadoop daemons to
communicate securely without repeatedly asking for passwords, which is crucial for
automation.
ssh localhost
ssh localhost
Core-Site Configuration:
The core-site.xml file specifies Hadoop’s core settings. The <fs.defaultFS> property is set to
hdfs://localhost:9000, defining the default Hadoop filesystem as HDFS. The
<hadoop.tmp.dir> property sets a temporary directory for Hadoop’s intermediate
operations. This configuration is necessary to initialize and manage HDFS correctly.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>
HDFS-Site Configuration:
The hdfs-site.xml file configures the Hadoop Distributed File System (HDFS). The
<dfs.replication> property is set to 1, meaning each file block is stored only once, which is
ideal for a single-node setup. <dfs.namenode.name.dir> and <dfs.datanode.data.dir>
specify directories for storing metadata and actual file data, ensuring proper data
organization.
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/datanode</value>
</property>
</configuration>
MapReduce Configuration:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
</configuration>
YARN Configuration:
The yarn-site.xml file sets up YARN, the resource management layer of Hadoop.
<yarn.resourcemanager.hostname> is set to localhost, defining where the
ResourceManager will run. <yarn.nodemanager.aux-services> is set to mapreduce_shuffle,
enabling data shuffling for MapReduce jobs. These settings ensure that YARN efficiently
schedules and executes tasks.
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Before starting Hadoop for the first time, the NameNode must be formatted using hdfs
namenode -format. This command initializes the HDFS metadata and clears any previous
data. Without formatting, the system might face inconsistencies, preventing Hadoop from
functioning correctly. This step is only required for the first setup.
To launch Hadoop, we run start-dfs.sh to start HDFS services (NameNode and DataNode)
and start-yarn.sh to start YARN (ResourceManager and NodeManager). These scripts
initialize the distributed storage and resource management layers of Hadoop. Running
them ensures that the cluster is up and ready for processing tasks.
start-dfs.sh
start-yarn.sh
After setting up Hadoop, we use the jps command to list all running Java processes. This
helps verify if essential Hadoop daemons like NameNode, DataNode, ResourceManager,
and NodeManager are running properly. If any service is missing, troubleshooting is
needed before proceeding.
Jps
Expected output:
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
Step 11: Hadoop Web Interfaces
These web UIs are useful for troubleshooting and observing cluster activity.