0% found this document useful (0 votes)
20 views6 pages

Experiment 1 Hadoop Installation

The document provides step-by-step instructions for installing Java and Hadoop on a single node. It covers downloading and extracting Hadoop, configuring environment variables, formatting HDFS, and starting the daemons to launch a single node Hadoop cluster in pseudo-distributed mode for testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Experiment 1 Hadoop Installation

The document provides step-by-step instructions for installing Java and Hadoop on a single node. It covers downloading and extracting Hadoop, configuring environment variables, formatting HDFS, and starting the daemons to launch a single node Hadoop cluster in pseudo-distributed mode for testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1.

Installation of Hadoop
Aim: to insall Java and Hadoop
Introduction:
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
Modules:
Hadoop includes these main modules ( Components):

 Hadoop Common: The common utilities and libraries that support the other Hadoop
modules.
 Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data. It is responsible for persisting data to disk. HDFS is a
file system that is distributed over numerous nodes.
 Hadoop YARN: A framework for job scheduling and cluster resource management. Yet
Another Resource Negotiator (YARN), is the “operating system” for HDFS. YARN manages
the layers of resources.
 Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. It is
a framework for developing applications that handle a massive amount of data. It distributes
work within the cluster or map, then organizes and reduces the results from the nodes into a
response to a query. Many other processing models are available for the 3.x version of
Hadoop.
Who Uses Hadoop?
A wide variety of companies and organizations use Hadoop for both research and production. Users
are encouraged to add themselves to the Hadoop.
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.
Hadoop has been used in machine learning and data mining techniques. It is also used for managing
multiple dedicated servers.

Single Node Hadoop Deployment (Pseudo-Distributed Mode)


Hadoop excels when deployed in a fully distributed mode on a large cluster of networked servers.
However, if you are new to Hadoop and want to explore basic commands or test applications, you
can configure Hadoop on a single node. It is suitable for learning about Hadoop, performing simple
operations, and debugging.
This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single
Java process. In this tutorial, you’ll run one of the example MapReduce programs it includes to
verify the installation.
Installation Starts Here
sudo apt-get remove ssh

sudo apt-get remove pdsh

Configure Password-less SSH

sudo apt install openssh-server openssh-client -y

ssh-keygen -t rsa -P ''

Generating public/private rsa key pair.


( Note: instead of “user”, your system name will be shown, type the path as shown )

Enter file in which to save the key (/home/csm-6/.ssh/id_rsa): /home/user/.ssh/id_rsa


/home/user/.ssh/id_rsa already exists.
Overwrite (y/n)? y

sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

sudo chmod 640 ~/.ssh/authorized_keys

ssh localhost

Installation Steps of Java and Hadoop


1. Install Java ( Note: If Java is not installed, you do this. Otherwise skip this step. )
sudo apt install default-jdk default-jre -y

2. Check Java version


java -version
3. Download latest stable version of hadoop
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

4. Extract the downloaded tar file


sudo tar -xvzf hadoop-3.3.1.tar.gz

5. Create Hadoop directory


To ensure that all of your files are organised in one location, move the extracted directory to
/usr/local/.
sudo mv hadoop-3.3.1 /usr/local/hadoop

6. To maintain hadoop logs, create a different directory inside of usr/local/hadoop called logs.
sudo mkdir /usr/local/hadoop/logs

7. Finally, use the following command to modify the directory’s ownership.


sudo chown -R hadoop:hadoop /usr/local/hadoop
8. Configure Hadoop
sudo nano ~/.bashrc

9. Once executing the above command you can see nano editor in your terminal then paste
following lines:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS=" -Djava.library.path=$HADOOP_HOME/lib/native"

Press CTRL + S to save and CTRL + X to exit the nano editor after copying the lines above.
10. Use the following command to activate environmental variables after closing the nano editor.
source ~/.bashrc

> Configure Java Environmental variables


You must define Java environment variables in the configuration file for hadoop-env.sh in order to
configure these components, including YARN, HDFS, MapReduce, and Hadoop-related project
settings.
11. Find Java path and Open-JDK directory with help of following commands

which javac

readlink -f /usr/bin/javac

12. Edit Hadoop-env.sh file


Open the hadoop-env.sh file in your preferred text editor first. In this case, I’ll use nano.

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add the next few lines to the file’s end now.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"

13. Javax activation


Install Javax by going to the hadoop directory.
cd /usr/local/hadoop/lib

Now, copy and paste the following command in your terminal to download javax activation file
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-
api/1.2.0/javax.activation-api-1.2.0.jar
14. Verify your hadoop by typing hadoop version
hadoop version
After successful installation, You will see the following lines
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r
a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar

15. Make a directory for node metadata storage and give it hadoop’s ownership
In the below commands, instead of word : “hadoop”, use your system user Like:
csm-1, a213a-6, etc.

sudo mkdir -p /home/hadoop/hdfs/{namenode,datanode}


sudo chown -R hadoop:hadoop /home/hadoop/hdfs

Format the HDFS NameNode and validate the Hadoop configuration.

16. Format namenode


hdfs namenode -format

Launch the Apache Hadoop Cluster


17. Launch the namenode and datanode
start-dfs.sh

Starting namenodes on [dkrao-HP-EliteBook-Folio-9470m]


Starting datanodes
Starting secondary namenodes [dkrao-HP-EliteBook-Folio-9470m]

18. Launch the yarn resource and node manager


start-yarn.sh

Starting resourcemanager
resourcemanager is running as process 2676. Stop it first and ensure /tmp/hadoop-dkrao-
resourcemanager.pid file is empty before retry.
Starting nodemanagers

19. Verify running components


jps

15921 NodeManager
16035 Jps
2676 ResourceManager

20. stop-all.sh

WARNING: Stopping all Apache Hadoop daemons as dkrao in 10 seconds.


WARNING: Use CTRL-C to abort.
Stopping namenodes on [dkrao-HP-EliteBook-Folio-9470m]
Stopping datanodes
Stopping secondary namenodes [dkrao-HP-EliteBook-Folio-9470m]
Stopping nodemanagers
Stopping resourcemanager

For more information, see https://blog.devgenius.io/install-configure-and-setup-hadoop-in-ubuntu-


a3cdd6305a0e

References:
https://hadoop.apache.org/
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html
https://cwiki.apache.org/confluence/display/HADOOP2/Hadoop2OnWindows

You might also like