Experiment 1
Experiment 1
M2 CSE
ROLL NO 1
EXPERIMENT 1
Hadoop is an open-source software framework that is used for storing and processing large amounts of
data in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.
Apache Software Foundation is the developers of Hadoop, and It was created in 2006, based on a
white paper written by Google in 2003 that described the Google file System (GFS) and the
MapReduce programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is intended to grow
from a single server to thousands of computers, each of which provides local computing and storage.
In traditional approach,if we want to process a data, we used to store data on local machines. This data
was then processed. As data started increasing, the local machines or computers were not capable
enough to store this huge data set.So, data was then started to be stored on remote servers.Practically it
is very complex and expensive to fetch this data.In the new Hadoop Approach, instead of fetching the
data on local machines we send the query to the data. the query to process the data will not be as huge
as the data itself. Moreover, at the server, the query is divided into several parts. All these parts process
the data simultaneously.Thus the Hadoop makes data storage, processing and analyzing way easier than
its traditional approach.
COMPONENTS OF HADOOP
1.HDFS: Hadoop Distributed file System is a dedicated file system to store big data with a
cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables
data to be stored at multiple nodes in the cluster which ensures data security and fault
tolerance.In our local PC, by default the block size in Hard Disk is 4KB. When we install
Hadoop, the HDFS by default changes the block size to 64 MB. Since it is used to store huge
data. We can also change the block size to 128 MB. Now HDFS works with Data Node and
Name Node. While Name Node is a master service and it keeps the metadata as for on which
commodity hardware, the data is residing, the Data Node stores the actual data. Now, since the
block size is of 64 MB thus the storage required to store metadata is reduced thus making
HDFS better. Also, Hadoop stores three copies of every dataset at three different locations.
This ensures that the Hadoop is not prone to single point of failure.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now
suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this
data is stored, this is called Mapping. Now the query is broken into multiple parts and the
results of all these multiple parts are combined and the overall result is sent back to the user.
This is called reduce process. Thus while HDFS is used to store the data, Map Reduce is used
to process the data.This parallel execution helps to execute a query faster and makes Hadoop a
3.YARN:As we know that Yet Another Resource Negotiator works like an operating system to
Hadoop and as operating systems are resource managers so YARN manages the resources of
● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and
● High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
● Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.
● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and
Advantages:
Disadvantages:
INSTALLATION
Use the cat command to store the public key as authorized_keys in the ssh directory
Set the permissions for your user with the chmod command
chmod 0600 ~/.ssh/authorized_keys
The new user is now able to SSH without needing to enter a password every time. Verify everything is
set up correctly by using the hdoop user to SSH to localhost
ssh localhost
Download and Install Hadoop on Ubuntu
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to
implement
Single Node Hadoop Deployment (Pseudo-Distributed
Mode)
Define the Hadoop environment variables by adding the following content to the end of the file
It is vital to apply the changes to the current running environment by using the following command
source ~/.bashrc
Edit hadoop-env.sh file
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings.When setting up a single node Hadoop cluster, you need to define
which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable
to access the hadoop-env.sh file:
Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the
OpenJDK installation on your system. If you have installed the same version as presented in the first
part of this tutorial, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and
edit log file. Configure the file by defining the NameNode and DataNode storage
directories.Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the
single node setup.
Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to change the default MapReduce framework name value to yarn
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml file
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the
Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPAT
H_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
start-dfs.sh
start-yarn.sh
jps
Create an input folder and put the data.txt into it using command
hdfs dfs -mkdir -p /input
hdfs dfs -put data.txt /input
hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar
wordcount /input /output
Output will be stored in /output folder.
To view the output