Experiment-2 BDA Lab
Experiment-2 BDA Lab
2. (i) Perform setting up and Installing Hadoop in its three operating modes:
a) Standalone,
b) Pseudo distributed,
c) Fully distributed
Fully Distributed Mode : This mode is fully distributed with minimum two or more machines
as a cluster.
Hadoop is supported by GNU/Linux platform and its flavors. Therefore, we have to install a
Linux operating system for setting up Hadoop environment. In case you have an OS other
than Linux, you can install a Virtualbox software in it and have Linux inside the Virtualbox
There are no daemons running and everything runs in a single JVM. Standalone mode is
suitable for running MapReduce programs during development, since it is easy to test and
debug them.
STEP-1 :
It downloads the package lists from the repositories and "updates" them to get information
on the newest versions of packages and their dependencies. It will do this for all repositories
and PPAs(Personal package archives).
Installing Java
Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in
your system using the command “java -version”.
$ java –version
Pre-installation Setup
Before installing Hadoop into the Linux environment, we need to set up Linux using ssh
(Secure Shell). Follow the steps given below for setting up the Linux environment.
The following commands are used for generating a key value pair using SSH. Copy the public
keys form id_dsa.pub to authorized_keys, and provide the owner with read and write
permissions to authorized_keys file respectively.
Downloading Hadoop:
$ wget -c http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.7.0/hadoop-
2.7.0.tar.gz
You can set Hadoop environment variables by appending the following commands to
~/.bashrc file.
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
$ source ~/.bashrc
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure.
$ cd /usr/local/hadoop/etc/hadoop
export JAVA_HOME="/usr/lib/jvm/java-7-oracle"
Output :-
$ hadoop version
If everything is fine with your setup, then you should see the following result:
Hadoop 2.7.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
d4c8d4d4d203c934e8074b31289a28724c0842cf
Compiled by jenkins on 2015-04-10T18:40Z
Compiled with protoc 2.5.0
From source with checksum a9e90912c37a35c3195d23951fd18f
This command was run using /usr/local/hadoop/hadoop-
2.7.0/share/hadoop/common/hadoop-common-2.7.0.jar
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a non-distributed mode on a single machine.
Continuation to the hadoop standalone mode setup, we have to make the following
configurations to convert hadoop setup as Pseudo Distributed mode.
You can find all the Hadoop configuration files in the location
“/usr/local/hadoop/etc/hadoop”. It is required to make changes in those configuration files
according to your Hadoop infrastructure. So we need to change the path before editing the
files as follows :
$ cd /usr/local/hadoop/etc/hadoop
The following are the list of files that you have to edit to configure Hadoop.
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and size of
Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
The hdfs-site.xml file contains information such as the value of replication data, namenode
path, and datanode paths of your local file systems. It means the place where you want to
store the Hadoop infrastructure.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes
according to your Hadoop infrastructure.
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the
following properties in between the <configuration>, </configuration> tags in this file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop
contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-
site,xml.template to mapred-site.xml file using the following command.
Open mapred-site.xml file and add the following properties in between the <configuration>,
</configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Verifying Hadoop Pseudo mode Installation
$ cd ~
$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
$ mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
Set up the namenode using the command “hdfs namenode -format” as follows.
The following command is used to start all services(daemons). Executing this command will
start your Hadoop file system.
$ start-all.sh
$ jps
Output :-
If everything is fine with your setup, then you should see the following result:
3796 Jps
3205 SecondaryNameNode
2849 NameNode
3362 ResourceManager
3492 DataNode
3498 NodeManager
FULLY DISTRIBUTED MODE :
As the whole cluster cannot be demonstrated, we are explaining the Hadoop cluster
environment using three systems (one master and two slaves); given below are their IP
addresses.
Clone Hadoop single node cluster as Hadoop master and slave-1, slave-2
(Take 3 systems installed with hadoop single node cluster and make the following changes to
configure fully distributed mode (multi node cluster))
192.168.1.15 master
192.168.1.16 slave1
192.168.1.17 slave2
Master
$ cd /usr/local/hadoop/etc/hadoop
replace value 1 as 3
slave<number>
Master
slave1
slave2
as dfs.namenode.name.dir
$ start-all.sh
Output :
Checking in Browser :
http://master:8088/
http://master:50070/
(ii)Use web based tools to monitor your Hadoop setup.
The default port number to access Hadoop is 50070. Use the following url to get Hadoop
services on browser.
http://localhost:50070
The default port number to access all applications of cluster is 8088. Use the following url to
visit this service.
http://localhost:8088
Fully Distributed mode