Experiment 1 Hadoop Installation
Experiment 1 Hadoop Installation
Installation of Hadoop
Aim: to insall Java and Hadoop
Introduction:
The Apache Hadoop software library is a framework that allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
Modules:
Hadoop includes these main modules ( Components):
Hadoop Common: The common utilities and libraries that support the other Hadoop
modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data. It is responsible for persisting data to disk. HDFS is a
file system that is distributed over numerous nodes.
Hadoop YARN: A framework for job scheduling and cluster resource management. Yet
Another Resource Negotiator (YARN), is the “operating system” for HDFS. YARN manages
the layers of resources.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. It is
a framework for developing applications that handle a massive amount of data. It distributes
work within the cluster or map, then organizes and reduces the results from the nodes into a
response to a query. Many other processing models are available for the 3.x version of
Hadoop.
Who Uses Hadoop?
A wide variety of companies and organizations use Hadoop for both research and production. Users
are encouraged to add themselves to the Hadoop.
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.
Hadoop has been used in machine learning and data mining techniques. It is also used for managing
multiple dedicated servers.
ssh localhost
6. To maintain hadoop logs, create a different directory inside of usr/local/hadoop called logs.
sudo mkdir /usr/local/hadoop/logs
9. Once executing the above command you can see nano editor in your terminal then paste
following lines:
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS=" -Djava.library.path=$HADOOP_HOME/lib/native"
Press CTRL + S to save and CTRL + X to exit the nano editor after copying the lines above.
10. Use the following command to activate environmental variables after closing the nano editor.
source ~/.bashrc
which javac
readlink -f /usr/bin/javac
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CLASSPATH+=" $HADOOP_HOME/lib/*.jar"
Now, copy and paste the following command in your terminal to download javax activation file
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-
api/1.2.0/javax.activation-api-1.2.0.jar
14. Verify your hadoop by typing hadoop version
hadoop version
After successful installation, You will see the following lines
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r
a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.3.1.jar
15. Make a directory for node metadata storage and give it hadoop’s ownership
In the below commands, instead of word : “hadoop”, use your system user Like:
csm-1, a213a-6, etc.
Starting resourcemanager
resourcemanager is running as process 2676. Stop it first and ensure /tmp/hadoop-dkrao-
resourcemanager.pid file is empty before retry.
Starting nodemanagers
15921 NodeManager
16035 Jps
2676 ResourceManager
20. stop-all.sh
References:
https://hadoop.apache.org/
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/SingleCluster.html
https://cwiki.apache.org/confluence/display/HADOOP2/Hadoop2OnWindows