0% found this document useful (0 votes)
29 views17 pages

Experiment 1

Hadoop mtech

Uploaded by

Fahma Famzin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views17 pages

Experiment 1

Hadoop mtech

Uploaded by

Fahma Famzin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

AJAY S RAM

M2 CSE
ROLL NO 1

EXPERIMENT 1

Aim : Study and configure Hadoop for BigData

Hadoop is an open-source software framework that is used for storing and processing large amounts of
data in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.
Apache Software Foundation is the developers of Hadoop, and It was created in 2006, based on a
white paper written by Google in 2003 that described the Google file System (GFS) and the
MapReduce programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is intended to grow
from a single server to thousands of computers, each of which provides local computing and storage.
In traditional approach,if we want to process a data, we used to store data on local machines. This data
was then processed. As data started increasing, the local machines or computers were not capable
enough to store this huge data set.So, data was then started to be stored on remote servers.Practically it
is very complex and expensive to fetch this data.In the new Hadoop Approach, instead of fetching the
data on local machines we send the query to the data. the query to process the data will not be as huge
as the data itself. Moreover, at the server, the query is divided into several parts. All these parts process
the data simultaneously.Thus the Hadoop makes data storage, processing and analyzing way easier than
its traditional approach.

COMPONENTS OF HADOOP
1.HDFS: Hadoop Distributed file System is a dedicated file system to store big data with a

cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables

data to be stored at multiple nodes in the cluster which ensures data security and fault

tolerance.In our local PC, by default the block size in Hard Disk is 4KB. When we install

Hadoop, the HDFS by default changes the block size to 64 MB. Since it is used to store huge

data. We can also change the block size to 128 MB. Now HDFS works with Data Node and

Name Node. While Name Node is a master service and it keeps the metadata as for on which

commodity hardware, the data is residing, the Data Node stores the actual data. Now, since the

block size is of 64 MB thus the storage required to store metadata is reduced thus making
HDFS better. Also, Hadoop stores three copies of every dataset at three different locations.

This ensures that the Hadoop is not prone to single point of failure.

2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now

suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this

data is stored, this is called Mapping. Now the query is broken into multiple parts and the

results of all these multiple parts are combined and the overall result is sent back to the user.

This is called reduce process. Thus while HDFS is used to store the data, Map Reduce is used

to process the data.This parallel execution helps to execute a query faster and makes Hadoop a

suitable and optimal choice to deal with Big Data.

3.YARN:As we know that Yet Another Resource Negotiator works like an operating system to

Hadoop and as operating systems are resource managers so YARN manages the resources of

Hadoop so that Hadoop serves big data in a better way.

Key features of hadoop


● Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.

● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to


operate even in the presence of hardware failures.

● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and

improve the performance

● High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of

data processing tasks.

● Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.

● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to


operate even in the presence of hardware failures.

● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and

improve the performance.

Advantages:

● Ability to store a large amount of data


● High flexibility
● Cost effective
● High computational power
● Tasks are independent
● Linear scaling
● Large community

Disadvantages:

● Not very effective for small data


● Hard cluster management
● Has stability issues
● Security concerns
● Complexity
● Limited Support for Real-time Processing
● Limited Support for Ad-hoc Queries
● Limited Support for Graph and Machine Learning

INSTALLATION

1. Install OpenJDK on Ubuntu

sudo apt update

sudo apt install openjdk-8-jdk -y


Set Up a Non-Root User for Hadoop Environment

Install OpenSSH on Ubuntu

sudo apt install openssh-server openssh-client -y

Create Hadoop User

sudo adduser hdoop


su - hdoop

Enable Passwordless SSH for Hadoop User

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Use the cat command to store the public key as authorized_keys in the ssh directory

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Set the permissions for your user with the chmod command
chmod 0600 ~/.ssh/authorized_keys

The new user is now able to SSH without needing to enter a password every time. Verify everything is
set up correctly by using the hdoop user to SSH to localhost

ssh localhost
Download and Install Hadoop on Ubuntu

Visit the official Apache Hadoop project page, and select the version of Hadoop you want to
implement
Single Node Hadoop Deployment (Pseudo-Distributed
Mode)

Configure Hadoop Environment Variables (bashrc)


sudo nano .bashrc

Define the Hadoop environment variables by adding the following content to the end of the file

#Hadoop Related Options


export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

It is vital to apply the changes to the current running environment by using the following command
source ~/.bashrc
Edit hadoop-env.sh file
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings.When setting up a single node Hadoop cluster, you need to define
which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable
to access the hadoop-env.sh file:

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the
OpenJDK installation on your system. If you have installed the same version as presented in the first
part of this tutorial, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit core-site.xml file


The core-site.xml file defines HDFS and Hadoop core properties.To set up Hadoop in a
pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary
directory Hadoop uses for the map and reduce process.

Open the core-site.xml file in a text editor:


sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration to override the default values for the temporary directory and add
your HDFS URL to replace the default local file system setting:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>

Edit hdfs-site.xml file

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and
edit log file. Configure the file by defining the NameNode and DataNode storage
directories.Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the
single node setup.

Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file


<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Edit mapred-site.xml file


Use the following command to access the mapred-site.xml file and define MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration to change the default MapReduce framework name value to yarn

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml file
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the
Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPAT
H_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Format HDFS NameNode

hdfs namenode -format


Start Hadoop Cluster

start-dfs.sh
start-yarn.sh
jps

Testing setup using WordCount example


Create a text file containing strings. For simplicity and faster debugging I have chosen this format

Create an input folder and put the data.txt into it using command
hdfs dfs -mkdir -p /input
hdfs dfs -put data.txt /input

Now call the MApReduce jar file

hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar
wordcount /input /output
Output will be stored in /output folder.
To view the output

hdfs dfs -cat /output/*

Cleaning can be done by removing the directories


hdfs dfs -rm -r /output
hdfs dfs -rm -r /input
Visiting UI
Hadoop has a UI which can be accessed by the url http://localhost:8000

Hadoop NameNode's web user interface can be accessed through http://localhost:9870

You might also like