0% found this document useful (0 votes)
16 views13 pages

Experiment-2 BDA Lab

The document outlines the installation process for Hadoop in three modes: Standalone, Pseudo Distributed, and Fully Distributed. It provides detailed steps for setting up the environment, configuring necessary files, and verifying the installation for each mode. Additionally, it includes instructions for monitoring the Hadoop setup using web-based tools.

Uploaded by

Sai Tejaswini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views13 pages

Experiment-2 BDA Lab

The document outlines the installation process for Hadoop in three modes: Standalone, Pseudo Distributed, and Fully Distributed. It provides detailed steps for setting up the environment, configuring necessary files, and verifying the installation for each mode. Additionally, it includes instructions for monitoring the Hadoop setup using web-based tools.

Uploaded by

Sai Tejaswini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Week 3, 4:

2. (i) Perform setting up and Installing Hadoop in its three operating modes:

a) Standalone,

b) Pseudo distributed,

c) Fully distributed

(ii) Use web based tools to monitor your Hadoop setup.

Hadoop Operation Modes :

Local/Standalone Mode : After downloading Hadoop in your system, by default, it is


configured in a standalone mode and can be run as a single java process.

Pseudo Distributed Mode : It is a distributed simulation on single machine. Each Hadoop


daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This mode is
useful for development.

Fully Distributed Mode : This mode is fully distributed with minimum two or more machines
as a cluster.

Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a
Linux operating system for setting up Hadoop environment. In case you have an OS other
than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox

Installing Hadoop in Standalone Mode :

Here we will discuss the installation of Hadoop 2.7.0 in standalone mode.

There are no daemons running and everything runs in a single JVM. Standalone mode is
suitable for running MapReduce programs during development, since it is easy to test and
debug them.

STEP-1 :

To get the required packages

It downloads the package lists from the repositories and "updates" them to get information
on the newest versions of packages and their dependencies. It will do this for all repositories
and PPAs(Personal package archives).

$ sudo apt-get update


STEP -2 :

Installing Java

Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in
your system using the command “java -version”.

$ java –version

If not install java using following commands…

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java7-installer

Pre-installation Setup

Before installing Hadoop into the Linux environment, we need to set up Linux using ssh
(Secure Shell). Follow the steps given below for setting up the Linux environment.

SSH setup is required to do different operations on a cluster such as starting, stopping,


distributed daemon shell operations. To authenticate different users of Hadoop, it is required
to provide public/private key pair for a Hadoop user and share it with different users.

The following commands are used for generating a key value pair using SSH. Copy the public
keys form id_dsa.pub to authorized_keys, and provide the owner with read and write
permissions to authorized_keys file respectively.

$ sudo apt-get install ssh

$ sudo apt-get install rsync

$ ssh-keygen -t dsa -P ' ' -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Downloading Hadoop:

$ wget -c http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.7.0/hadoop-
2.7.0.tar.gz

Extracting and Installing hadoop

$ sudo tar -zxvf hadoop-2.7.0.tar.gz

To provide accessibility to all the users changing the location of hadoop


$ sudo mv hadoop /usr/local/hadoop

update-java-alternatives updates all alternatives belonging to one


runtime or development kit for the Java language. A package does
provide these information of it's alternatives in
/usr/lib/jvm/.<jname>.jinfo.

$ update-alternatives --config java

You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.

$ sudo gedit ~/.bashrc

#Hadoop Variables

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

export HADOOP_HOME=/usr/local/hadoop

export PATH=$PATH:$HADOOP_HOME/bin

export PATH=$PATH:$HADOOP_HOME/sbin

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Redirect the environment to hadoop environment variables

$ source ~/.bashrc

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure.

Changing directory path to following path :

$ cd /usr/local/hadoop/etc/hadoop

Setting up JAVA_HOME path :


In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of java in
your system.

$ sudo gedit hadoop-env.sh

#The java implementation to use.

export JAVA_HOME="/usr/lib/jvm/java-7-oracle"

so that’s all required for Hadoop standalone installation process.

Output :-

Before proceeding further, to make sure that Hadoop is working fine.

Just issue the following command:

$ hadoop version

If everything is fine with your setup, then you should see the following result:

aliet@lab1-9:~$ hadoop version

Hadoop 2.7.0

Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /usr/local/hadoop/hadoop-
2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar

It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a non-distributed mode on a single machine.

----------------------------Hadoop Stand alone Mode installation process ends here-----------------


Installing Hadoop in Pseudo Distributed Mode :

Continuation to the hadoop standalone mode setup, we have to make the following
configurations to convert hadoop setup as Pseudo Distributed mode.

You can find all the Hadoop configuration files in the location
“/usr/local/hadoop/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure. So we need to change the path before editing the
files as follows :

$ cd /usr/local/hadoop/etc/hadoop

The following are the list of files that you have to edit to configure Hadoop.

$ Sudo gedit core-site.xml

The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.

Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

$sudo gedit hdfs-site.xml

The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure.

$ sudo gedit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.

$sudo gedit yarn-site.xml

This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

mapred-site.xml

This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.

$ sudo cp mapred.site.xml.template mapred-site.xml

$sudo gedit mapred-site.xml

Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Pseudo mode Installation

The following steps are used to verify the Hadoop installation.

$ cd ~

Creating directories related to name node and datanode

$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode

$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode

Changing the ownership and providing access to hadoop directories

$ sudo chown aliet:aliet -R /usr/local/hadoop

Set up the namenode using the command “hdfs namenode -format” as follows.

$ hdfs namenode –format

The following command is used to start all services(daemons). Executing this command will
start your Hadoop file system.

$ start-all.sh

$ jps

Output :-

If everything is fine with your setup, then you should see the following result:

3796 Jps
3205 SecondaryNameNode
2849 NameNode
3362 ResourceManager
3492 DataNode
3498 NodeManager
FULLY DISTRIBUTED MODE :

Hadoop Multi-Node cluster on a distributed environment

As the whole cluster cannot be demonstrated, we are explaining the Hadoop cluster
environment using three systems (one master and two slaves); given below are their IP
addresses.

Hadoop Master : 192.168.1.15 (hadoop-master)


Hadoop Slave : 192.168.1.16 (hadoop-slave-1)
Hadoop Slave : 192.168.1.17 (hadoop-slave-2)

1. Installing Java and checking the installation


2. Installing SSH and adding key

Clone Hadoop single node cluster as Hadoop master and slave-1, slave-2
(Take 3 systems installed with hadoop single node cluster and make the following changes to
configure fully distributed mode (multi node cluster))

In Hadoop master node and slave nodes :

$ sudo gedit /etc/hosts

192.168.1.15 master
192.168.1.16 slave1
192.168.1.17 slave2

$ sudo gedit /etc/hostname

Master

$ cd /usr/local/hadoop/etc/hadoop

$ sudo gedit core-site.xml

replace localhost as master

$ sudo gedit hdfs-site.xml

replace value 1 as 3

$ sudo gedit yarn-site.xml

add the following configuration


<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>
</configuration>

$ sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

remove dfs.namenode.name.dir property section

$ sudo rm -rf /usr/local/hadoop/hadoop_data

$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode

$ sudo chown -R aliet:aliet /usr/local/hadoop

Reboot master node

//In Slave nodes only

Hadoopslave Node (conf should be done on each slavenode)

$ sudo nano /etc/hostname

slave<number>

reboot all nodes

// In Hadoopmaster Node only

$ sudo nano /usr/local/hadoop/etc/hadoop/masters

Master

$ sudo nano /usr/local/hadoop/etc/hadoop/slaves


remove localhost and add

slave1
slave2

$ sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

replace dfs.datanode.data.dir property section

as dfs.namenode.name.dir

$ sudo rm -rf /usr/local/hadoop/hadoop_data

$ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode

$ sudo chown -R aliet:aliet /usr/local/hadoop

$ hadoop namenode -format

$ start-all.sh

$ jps (check in all 3 datanodes)

Output :

Checking in Browser :

http://master:8088/

http://master:50070/
(ii)Use web based tools to monitor your Hadoop setup.

The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on browser.

http://localhost:50070

Verify All Applications for Cluster

The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.

http://localhost:8088
Fully Distributed mode

http://master:50070 (or) http://192.168.1.15:50070


http://master:8088 (or) http://192.168.1.15:8088

You might also like