BDA Lab File
BDA Lab File
Hadoop Installation
Aim: Perform the settings up and Installation of Hadoop in its’ two operation modes
1. Pseudo distributed mode
2. Fully distributed mode
Definition: In Pseudo Distributed Mode, Hadoop runs on a single machine (a single node), but
all the Hadoop daemons (such as Name Node, Data Node, Resource Manager, Node Manager)
run as separate processes on the same machine. It simulates a distributed environment while
running on one machine, making it suitable for development and testing.
Requirements:
1. Install Java:
Hadoop requires Java to run, so you need to install the correct version of Java (usually Java 8
or later).
Verify installation:
java -version
2. Download Hadoop:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
3. Extract Hadoop:
4. Set Environment Variables: Add Hadoop and Java environment variables in ~/.bashrc
or ~/.bash_profile:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_HOME=$HADOOP_HOME
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hdfs/datanode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
6. Format the Hadoop file system: Format the HDFS Nam Node to initialize the file
system.
start-dfs.sh
start-yarn.sh
8. Verify Installation:
o Access the Hadoop Web UI at http://localhost:50070 for HDFS and
http://localhost:8088 for YARN.
o Run a sample Hadoop job (e.g., word count) to verify the cluster functionality.
Definition: In Fully Distributed Mode, Hadoop runs on a cluster of multiple machines (nodes).
Each machine runs its own set of daemons (Name Node, Data Node, Resource Manager, Node
Manager). This mode is intended for production environments and requires proper
configuration and management of multiple machines, ensuring data replication, fault tolerance,
and high availability.
Requirements:
• Ensure all the machines have a unique hostname and proper network
connectivity.
• Set up passwordless SSH between all nodes in the cluster (using ssh-keygen
and copying the public key to all machines).
2. Install Java:
• Follow the same process of downloading and extracting Hadoop on each node
in the cluster.
• The configuration on slave nodes is very similar, but you will need to specify
the master node's Name Node URI in the core-site.xml file.
6. Format the Name Node: On the Name Node (master), format the HDFS:
start-dfs.sh
start-yarn.sh
start-dfs.sh
start-yarn.sh