0% found this document useful (0 votes)

29 views17 pages

Experiment 1

Hadoop mtech

Uploaded by

Fahma Famzin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views17 pages

Experiment 1

Hadoop mtech

Uploaded by

Fahma Famzin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

AJAY S RAM

M2 CSE
ROLL NO 1

EXPERIMENT 1

Aim : Study and configure Hadoop for BigData

Hadoop is an open-source software framework that is used for storing and processing large amounts of
data in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.
Apache Software Foundation is the developers of Hadoop, and It was created in 2006, based on a
white paper written by Google in 2003 that described the Google file System (GFS) and the
MapReduce programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is intended to grow
from a single server to thousands of computers, each of which provides local computing and storage.
In traditional approach,if we want to process a data, we used to store data on local machines. This data
was then processed. As data started increasing, the local machines or computers were not capable
enough to store this huge data set.So, data was then started to be stored on remote servers.Practically it
is very complex and expensive to fetch this data.In the new Hadoop Approach, instead of fetching the
data on local machines we send the query to the data. the query to process the data will not be as huge
as the data itself. Moreover, at the server, the query is divided into several parts. All these parts process
the data simultaneously.Thus the Hadoop makes data storage, processing and analyzing way easier than
its traditional approach.

COMPONENTS OF HADOOP
1.HDFS: Hadoop Distributed file System is a dedicated file system to store big data with a

cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables

data to be stored at multiple nodes in the cluster which ensures data security and fault

tolerance.In our local PC, by default the block size in Hard Disk is 4KB. When we install

Hadoop, the HDFS by default changes the block size to 64 MB. Since it is used to store huge

data. We can also change the block size to 128 MB. Now HDFS works with Data Node and

Name Node. While Name Node is a master service and it keeps the metadata as for on which

commodity hardware, the data is residing, the Data Node stores the actual data. Now, since the

block size is of 64 MB thus the storage required to store metadata is reduced thus making
HDFS better. Also, Hadoop stores three copies of every dataset at three different locations.

This ensures that the Hadoop is not prone to single point of failure.

2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now

suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this

data is stored, this is called Mapping. Now the query is broken into multiple parts and the

results of all these multiple parts are combined and the overall result is sent back to the user.

This is called reduce process. Thus while HDFS is used to store the data, Map Reduce is used

to process the data.This parallel execution helps to execute a query faster and makes Hadoop a

suitable and optimal choice to deal with Big Data.

3.YARN:As we know that Yet Another Resource Negotiator works like an operating system to

Hadoop and as operating systems are resource managers so YARN manages the resources of

Hadoop so that Hadoop serves big data in a better way.

Key features of hadoop

● Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.

● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to

operate even in the presence of hardware failures.

● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and

improve the performance

● High Availability: Hadoop provides High Availability feature, which helps to make sure
that the data is always available and is not lost.
● Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of

data processing tasks.

● Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for
the storage and processing of extremely large amounts of data.

● Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to

operate even in the presence of hardware failures.

● Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and

improve the performance.

Advantages:

● Ability to store a large amount of data

● High flexibility
● Cost effective
● High computational power
● Tasks are independent
● Linear scaling
● Large community

Disadvantages:

● Not very effective for small data

● Hard cluster management
● Has stability issues
● Security concerns
● Complexity
● Limited Support for Real-time Processing
● Limited Support for Ad-hoc Queries
● Limited Support for Graph and Machine Learning

INSTALLATION

1. Install OpenJDK on Ubuntu

sudo apt update

sudo apt install openjdk-8-jdk -y

Set Up a Non-Root User for Hadoop Environment

Install OpenSSH on Ubuntu

sudo apt install openssh-server openssh-client -y

Create Hadoop User

sudo adduser hdoop

su - hdoop

Enable Passwordless SSH for Hadoop User

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Use the cat command to store the public key as authorized_keys in the ssh directory

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Set the permissions for your user with the chmod command
chmod 0600 ~/.ssh/authorized_keys

The new user is now able to SSH without needing to enter a password every time. Verify everything is
set up correctly by using the hdoop user to SSH to localhost

ssh localhost
Download and Install Hadoop on Ubuntu

Visit the official Apache Hadoop project page, and select the version of Hadoop you want to
implement
Single Node Hadoop Deployment (Pseudo-Distributed
Mode)

Configure Hadoop Environment Variables (bashrc)

sudo nano .bashrc

Define the Hadoop environment variables by adding the following content to the end of the file

#Hadoop Related Options

export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

It is vital to apply the changes to the current running environment by using the following command
source ~/.bashrc
Edit hadoop-env.sh file
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings.When setting up a single node Hadoop cluster, you need to define
which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable
to access the hadoop-env.sh file:

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the
OpenJDK installation on your system. If you have installed the same version as presented in the first
part of this tutorial, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit core-site.xml file

The core-site.xml file defines HDFS and Hadoop core properties.To set up Hadoop in a
pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary
directory Hadoop uses for the map and reduce process.

Open the core-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration to override the default values for the temporary directory and add
your HDFS URL to replace the default local file system setting:

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>

Edit hdfs-site.xml file

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and
edit log file. Configure the file by defining the NameNode and DataNode storage
directories.Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the
single node setup.

Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file

<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

Edit mapred-site.xml file

Use the following command to access the mapred-site.xml file and define MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration to change the default MapReduce framework name value to yarn

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml file
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the
Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPAT
H_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Format HDFS NameNode

hdfs namenode -format

Start Hadoop Cluster

start-dfs.sh
start-yarn.sh
jps

Testing setup using WordCount example

Create a text file containing strings. For simplicity and faster debugging I have chosen this format

Create an input folder and put the data.txt into it using command
hdfs dfs -mkdir -p /input
hdfs dfs -put data.txt /input

Now call the MApReduce jar file

hadoop jar
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar
wordcount /input /output
Output will be stored in /output folder.
To view the output

hdfs dfs -cat /output/*

Cleaning can be done by removing the directories

hdfs dfs -rm -r /output
hdfs dfs -rm -r /input
Visiting UI
Hadoop has a UI which can be accessed by the url http://localhost:8000

Hadoop NameNode's web user interface can be accessed through http://localhost:9870

Comp131 Test
76% (17)
Comp131 Test
84 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
(Ebook PDF) Business Driven Technology 7th Edition Download
100% (3)
(Ebook PDF) Business Driven Technology 7th Edition Download
25 pages
213nt1306 - Big Data Analytics Lab Manual
No ratings yet
213nt1306 - Big Data Analytics Lab Manual
80 pages
Hello Guys This Is Tutorial in Depth of The Topic Spamming
100% (7)
Hello Guys This Is Tutorial in Depth of The Topic Spamming
22 pages
PaaS For Dummies
No ratings yet
PaaS For Dummies
69 pages
Bda Practical
No ratings yet
Bda Practical
62 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Hadoop The Definitive Guide Third Edition Tom White Download
100% (2)
Hadoop The Definitive Guide Third Edition Tom White Download
82 pages
CounterACT Console User Manual 7.0.0 PDF
No ratings yet
CounterACT Console User Manual 7.0.0 PDF
763 pages
Unit 4 Unit 4 Bda
No ratings yet
Unit 4 Unit 4 Bda
16 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
(Ebook PDF) Accounting Information Systems 11th Edition by Patrick Wheeler PDF Download
100% (2)
(Ebook PDF) Accounting Information Systems 11th Edition by Patrick Wheeler PDF Download
51 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
6-Ebook Vault Core Data Streaming-English
No ratings yet
6-Ebook Vault Core Data Streaming-English
24 pages
BDA Lab Manual UPDATED
No ratings yet
BDA Lab Manual UPDATED
45 pages
ADM Hadoop
No ratings yet
ADM Hadoop
25 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Bda 2
No ratings yet
Bda 2
25 pages
01 HK 082010005150001
No ratings yet
01 HK 082010005150001
56 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Install and Run Hadoop On Windows
No ratings yet
Install and Run Hadoop On Windows
29 pages
Hadoopfile PP
No ratings yet
Hadoopfile PP
83 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Bi Lab File
No ratings yet
Bi Lab File
19 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Red Hat Enterprise Linux 7 System Level Authentication Guide en US
No ratings yet
Red Hat Enterprise Linux 7 System Level Authentication Guide en US
153 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
BDA LAB Programs
No ratings yet
BDA LAB Programs
56 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
No ratings yet
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
9 pages
11TH CS Sample Paper
No ratings yet
11TH CS Sample Paper
4 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Unit-2 - Hadoop2
No ratings yet
Unit-2 - Hadoop2
30 pages
Installation of Hadoop in Ubuntu
No ratings yet
Installation of Hadoop in Ubuntu
15 pages
Introduction To
No ratings yet
Introduction To
7 pages
Unit 4
No ratings yet
Unit 4
14 pages
Bda Unit-4 Notes
No ratings yet
Bda Unit-4 Notes
15 pages
Sqoop Tutorial: Sqoop: "SQL To Hadoop and Hadoop To SQL"
No ratings yet
Sqoop Tutorial: Sqoop: "SQL To Hadoop and Hadoop To SQL"
11 pages
SystemC TLM PDF
No ratings yet
SystemC TLM PDF
95 pages
Install Hadoop
No ratings yet
Install Hadoop
8 pages
w4l1 - Er Model
No ratings yet
w4l1 - Er Model
56 pages
Hbase Installationn
No ratings yet
Hbase Installationn
12 pages
Notes - Unit 4 - Basics of Hadoop-3-16
No ratings yet
Notes - Unit 4 - Basics of Hadoop-3-16
14 pages
Big Data Analytics Lab Experiments
No ratings yet
Big Data Analytics Lab Experiments
16 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Bda Lab
No ratings yet
Bda Lab
37 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
19 pages
Bda Record
No ratings yet
Bda Record
27 pages
Hadoop Configuration
No ratings yet
Hadoop Configuration
12 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
TP2 - 3IM - en
No ratings yet
TP2 - 3IM - en
7 pages
Chapter#3 Design
No ratings yet
Chapter#3 Design
11 pages
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
No ratings yet
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
64 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
Assignment 1 Write-Up
No ratings yet
Assignment 1 Write-Up
8 pages
Experiment-2 BDA Lab
No ratings yet
Experiment-2 BDA Lab
13 pages
Complete Module Required For A Complete Inventory Management System
No ratings yet
Complete Module Required For A Complete Inventory Management System
11 pages
Unix Commands Part 2
No ratings yet
Unix Commands Part 2
37 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
20 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Empowerment Technologies
No ratings yet
Empowerment Technologies
24 pages
Choosing The Right Usability Technique:: Getting The Answers You Need
No ratings yet
Choosing The Right Usability Technique:: Getting The Answers You Need
32 pages
Express Middleware - Wds
No ratings yet
Express Middleware - Wds
15 pages
Another Intro To Hadoop
No ratings yet
Another Intro To Hadoop
23 pages
A Report On Distributed Computing
No ratings yet
A Report On Distributed Computing
25 pages
EXAM PRACTICE - Session 1
No ratings yet
EXAM PRACTICE - Session 1
23 pages
DBMS Lab VIVA Questions
No ratings yet
DBMS Lab VIVA Questions
6 pages
Report of Project 55
No ratings yet
Report of Project 55
13 pages
Relational Cloud: A Database-as-a-Service For The Cloud
No ratings yet
Relational Cloud: A Database-as-a-Service For The Cloud
6 pages
Experiment 1 Hadoop Installation
No ratings yet
Experiment 1 Hadoop Installation
6 pages
Step 1 - Install Oracle Java 8 On Ubuntu
No ratings yet
Step 1 - Install Oracle Java 8 On Ubuntu
7 pages
Tes
No ratings yet
Tes
18 pages
MayuriKothawade Resume
No ratings yet
MayuriKothawade Resume
5 pages
4 Requirement Analysis and Specification
No ratings yet
4 Requirement Analysis and Specification
3 pages
Information Reporting System and Executive Information System
No ratings yet
Information Reporting System and Executive Information System
9 pages
Apache Hadoop: Getting Started With
No ratings yet
Apache Hadoop: Getting Started With
7 pages
Convert SPWebApplication
No ratings yet
Convert SPWebApplication
3 pages
System Administrator Interview Questions
No ratings yet
System Administrator Interview Questions
2 pages

Experiment 1

Uploaded by

Experiment 1

Uploaded by

AJAY S RAM

Aim : Study and configure Hadoop for BigData

suitable and optimal choice to deal with Big Data.

Hadoop so that Hadoop serves big data in a better way.

Key features of hadoop

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to

improve the performance

data processing tasks.

● Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to

improve the performance.

● Ability to store a large amount of data

● Not very effective for small data

1. Install OpenJDK on Ubuntu

sudo apt update

sudo apt install openjdk-8-jdk -y

Install OpenSSH on Ubuntu

sudo apt install openssh-server openssh-client -y

Create Hadoop User

sudo adduser hdoop

Enable Passwordless SSH for Hadoop User

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Configure Hadoop Environment Variables (bashrc)

#Hadoop Related Options

Edit core-site.xml file

Open the core-site.xml file in a text editor:

Edit hdfs-site.xml file

Add the following configuration to the file

Edit mapred-site.xml file

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:

Format HDFS NameNode

hdfs namenode -format

Testing setup using WordCount example

Now call the MApReduce jar file

hdfs dfs -cat /output/*

Cleaning can be done by removing the directories

Hadoop NameNode's web user interface can be accessed through http://localhost:9870

You might also like