0% found this document useful (0 votes)
18 views18 pages

Bda Aat

The document describes the steps to install Hadoop on a system: 1. Download and extract the Java and Hadoop packages. 2. Add the Java and Hadoop paths to the .bashrc file to set the environment variables. 3. Check the Java and Hadoop versions to confirm a successful installation. 4. Edit the Hadoop configuration files core-site.xml and hdfs-site.xml.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views18 pages

Bda Aat

The document describes the steps to install Hadoop on a system: 1. Download and extract the Java and Hadoop packages. 2. Add the Java and Hadoop paths to the .bashrc file to set the environment variables. 3. Check the Java and Hadoop versions to confirm a successful installation. 4. Edit the Hadoop configuration files core-site.xml and hdfs-site.xml.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

B M S COLLEGE OF ENGINEERING

(An Autonomous Ins+tu+on Af liated to VTU, Belagavi)

Post Box No.: 1908, Bull Temple Road, Bengaluru – 560 019

DEPARTMENT OF MACHINE LEARNING

Academic Year: 2022-2023 (Session: April 2023 - July 2023)

SOCIAL MEDIA ANALYTICS (22AM6PESMA)

ALTERNATIVE ASSESSMENT TOOL (AAT)

AAT
HADOOP
Submitted by

Student Name:
Team Member 1: A Vaishnavi
Team Member 2: Abitha Bala Subramani
USN:
Team Member 1: 1BM20AI053
Team Member 2: 1BM20AI065
Semester & Section: 6A
Student Signature:
Team Member 1:
Team Member 2:

Valuation Report (to be lled by the faculty)

Score: Faculty In-charge: Dr. Arun


Faculty Signature:
Comments:
with date


fi
fi
Introduction

Hadoop is an open-source framework designed for distributed storage and processing


of large datasets across clusters of computers. It provides a scalable, fault-tolerant
solution for handling big data. Hadoop allows you to process and analyze massive
amounts of structured, semi-structured, and unstructured data in a distributed
computing environment.

The key components of Hadoop are:

1. Hadoop Distributed File System (HDFS): HDFS is a distributed le system


that stores data across multiple machines in a cluster. It provides high throughput
access to data and ensures fault tolerance by replicating data across multiple
nodes.

2. Yet Another Resource Negotiator (YARN): YARN is the resource


management framework in Hadoop. It manages cluster resources and schedules
tasks for processing data. YARN allows different data processing engines, such as
MapReduce, Spark, and Hive, to run on the same cluster.

3. MapReduce: MapReduce is a programming model and processing engine for


distributed processing of large datasets in Hadoop. It divides the data into smaller
chunks and processes them in parallel across multiple nodes in the cluster.
MapReduce consists of two phases: the map phase, which processes the input
data and produces intermediate results, and the reduce phase, which aggregates
the intermediate results to produce the nal output.

Hadoop has gained popularity due to its ability to handle massive volumes of data
and its scalability. It allows organizations to store, process, and analyze data that
would be impractical or infeasible to handle using traditional databases or single-
node systems. Hadoop is widely used in various industries, including nance,
healthcare, retail, social media, and more, for tasks like log processing, data
warehousing, recommendation systems, and large-scale data analytics.
fi
fi
fi
Working Principle

The working principle of Hadoop revolves around its key components: the Hadoop
Distributed File System (HDFS) and the MapReduce processing framework. Let's
break down the working principles of each component:

1. Hadoop Distributed File System (HDFS):


• Data Storage: HDFS divides large datasets into blocks and distributes them across
multiple machines in a cluster. Each block is replicated across different nodes for
fault tolerance.
• Master-Slave Architecture: HDFS follows a master-slave architecture. The master
node, called the NameNode, manages the le system namespace and tracks the
location of each data block. The slave nodes, known as DataNodes, store and
manage the actual data blocks.
• Data Replication: HDFS replicates data blocks across multiple DataNodes. By
default, each block is replicated three times to ensure data reliability. The
replication factor and block placement policies can be con gured based on the
desired level of fault tolerance and data locality.

2. MapReduce Processing Framework:


• Data Processing: MapReduce is a programming model that allows distributed
processing of large datasets across a Hadoop cluster. It consists of two stages: the
map stage and the reduce stage.
• Map Stage: In the map stage, data is processed in parallel across the nodes in the
cluster. Each node processes a portion of the input data and produces intermediate
key-value pairs.
• Shuf e and Sort: After the map stage, the intermediate key-value pairs are grouped
and sorted based on the keys. This process is known as shuf e and sort.
• Reduce Stage: In the reduce stage, the sorted intermediate data is processed to
produce the nal output. Each node processes a subset of the sorted data and
produces the nal output key-value pairs.
• Fault Tolerance: MapReduce provides fault tolerance by automatically handling
failures. If a node fails during processing, the framework redistributes the failed
task to another node in the cluster.

The overall working principle of Hadoop involves storing large datasets across a
distributed le system (HDFS) and processing the data using the MapReduce
framework. HDFS ensures data reliability and fault tolerance, while MapReduce
enables parallel processing of the data for distributed computing. This combination
fl
fi
fi
fi
fi
fl
fi
allows Hadoop to handle large-scale data processing tasks ef ciently and reliably
across a cluster of machines.

Installation

Step 1: Download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.


Command: wget https://archive.apache.org/dist/hadoop/core/
hadoop-2.7.3/hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Downloading Hadoop

Step 4: Extract the Hadoop tar File.


Command: tar -xvf hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Extracting Hadoop Files

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.

fi

Command: vi .bashrc

Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source
command.

Command: source .bashrc

Fig: Hadoop Installation – Refreshing environment variables

To make sure that Java and Hadoop have been properly installed
on your system and can be accessed through the Terminal, execute
the java -version and hadoop version commands.

Command: java -version

Fig: Hadoop Installation – Checking Java Version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.


Command: cd hadoop-2.7.3/etc/hadoop/

Command: ls

All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop


directory as you can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files


Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the
cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.

Command: vi core-site.xml

Fig: Hadoop Installation – Configuring core-site.xml

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e.
NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.

Command: vi hdfs-site.xml

Fig: Hadoop Installation – Configuring hdfs-site.xml

Step 9: Edit the mapred-site.xml file and edit the property mentioned
below inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce application
like number of JVM that can run in parallel, the size of the mapper and the
reducer process, CPU cores available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create


the mapred-site.xml file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

Fig: Hadoop Installation – Configuring mapred-site.xml

Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation
needed on program & algorithm, etc.

Command: vi yarn-site.xml

Fig: Hadoop Installation – Configuring yarn-site.xml


Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the
script to run Hadoop like Java home path, etc.

Command: vi hadoop–env.sh

Fig: Hadoop Installation – Configuring hadoop-env.sh

Step 12: Go to Hadoop home directory and format the NameNode.


Command: cd

Command: cd hadoop-2.7.3

Command: bin/hadoop namenode -format

Fig: Hadoop Installation – Formatting NameNode


This formats the HDFS via NameNode. This command is only executed
for the first time. Formatting the file system means initializing the
directory specified by the dfs.name.dir variable.

Never format, up and running Hadoop filesystem. You will lose all your
data stored in the HDFS.

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin


directory and start all the daemons.
Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it


individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh


& mr-jobhistory-daemon.sh

Or you can run all the services individually as below:

Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the
directory tree of all files stored in the HDFS and tracks all the file stored
across the cluster.

Command: ./hadoop-daemon.sh start namenode

Fig: Hadoop Installation – Starting NameNode


Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the
requests from the Namenode for different operations.

Command: ./hadoop-daemon.sh start datanode

Fig: Hadoop Installation – Starting DataNode

Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster
resources and thus helps in managing the distributed applications running
on the YARN system. Its work is to manage each NodeManagers and the
each application’s ApplicationMaster.

Command: ./yarn-daemon.sh start resourcemanager

Fig: Hadoop Installation – Starting ResourceManager

Start NodeManager:
The NodeManager in each machine framework is the agent which is
responsible for managing containers, monitoring their resource usage and
reporting the same to the ResourceManager.

Command: ./yarn-daemon.sh start nodemanager


Fig: Hadoop Installation – Starting NodeManager

Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related
requests from client.

Command: ./mr-jobhistory-daemon.sh start historyserver

Step 14: To check that all the Hadoop services are up and running, run the
below command.
Command: jps

Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go to localhost:50070/


dfshealth.html to check the NameNode interface.

Fig: Hadoop Installation – Starting WebUI

CASE STUDY

Title: Big Data Analytics Using Apache Hadoop: A Case Study on


Different Fertiliser Requirements and Availability in Different States of
India from 2012-2013 to 2014-2015

Abstract:
This case study explores the application of big data analytics using Apache
Hadoop to analyse the fertiliser requirements and availability across
various states in India from 2012-2013 to 2014-2015. The study aims to
identify patterns, trends, and insights that can help policymakers and
stakeholders make informed decisions regarding fertiliser allocation and
distribution.

1. Introduction:
The agriculture sector plays a crucial role in India's economy, and fertiliser
usage is a critical factor in achieving higher agricultural productivity. By
harnessing big data analytics, this study seeks to understand the fertiliser
requirements and availability in different states of India during the
specified time period.

2. Methodology:

2.1 Data Collection:


The study gathers data from various sources, including government
reports, agricultural surveys, and fertiliser production and distribution
records. The dataset includes information on fertiliser types, quantities
used, states, and years.

2.2 Data Preprocessing:


The collected data is cleaned, integrated, and transformed to ensure
consistency and compatibility. Missing values, outliers, and
inconsistencies are handled appropriately.

2.3 Data Storage and Processing:


Apache Hadoop, a widely used big data framework, is employed to store
and process the large-scale dataset. Hadoop Distributed File System
(HDFS) stores the data, while Hadoop MapReduce facilitates distributed
processing.

3. Analysis and Results:


3.1 Fertiliser Requirements:
The study examines the fertiliser requirements across different states
during the specified time period. It analyses the demand for various
fertiliser types, such as nitrogen, phosphorus, and potassium-based
fertilisers, and identifies the states with the highest requirements.

3.2 Fertiliser Availability:


The availability of different fertilisers in each state is investigated. The
study explores the distribution patterns and identifies any supply-demand
gaps in specific regions.

3.3 Temporal Analysis:


Temporal analysis is conducted to observe the trends and changes in
fertiliser requirements and availability over the three-year period. Seasonal
variations and long-term patterns are studied to understand the dynamics
of fertiliser usage.

4. Insights and Recommendations:


Based on the analysis, the study provides valuable insights into the
fertiliser requirements and availability in different states of India. These

insights can be used by policymakers, agricultural experts, and fertiliser


manufacturers to optimise allocation, distribution, and production
strategies.

5. Conclusion:
The case study demonstrates the application of big data analytics using
Apache Hadoop for analysing fertiliser requirements and availability in
various states of India. The study's findings offer valuable insights into
optimising fertiliser allocation and distribution, ultimately contributing to
improved agricultural productivity and sustainability.

6. Limitations and Future Work:


The study acknowledges potential limitations such as data quality,
representativeness, and the scope of analysis. Future work can focus on
incorporating more recent data, including additional variables such as crop
types, rainfall, and soil quality, to enhance the analysis and provide more
comprehensive recommendations.

By leveraging big data analytics, this case study contributes to data-driven


decision-making in the agricultural domain, enabling stakeholders to make
informed choices for a more efficient and sustainable fertiliser
management system in India.

In the above case study, Apache Hadoop is used as the primary big data
framework to store, process, and analyse the large-scale dataset related to
fertiliser requirements and availability in different states of India from
2012-2013 to 2014-2015. Here's how Hadoop is utilised:

1. Data Storage:
Hadoop Distributed File System (HDFS) is employed to store the dataset.
HDFS is a distributed file system designed to store large volumes of data
across multiple machines in a cluster. It provides fault tolerance and high
scalability, allowing the dataset to be partitioned and distributed across the
cluster.

2. Data Processing:

Hadoop MapReduce, a programming model and processing framework, is


used for distributed data processing. MapReduce allows for parallel
processing of data across the Hadoop cluster, enabling efficient
computation on large datasets.

3. Data Preprocessing:
Before analysis, the collected data is preprocessed using Hadoop. This
involves cleaning the data, handling missing values, removing outliers, and
transforming the data to ensure consistency and compatibility. Hadoop's
distributed processing capability helps handle the preprocessing tasks
efficiently.

4. Analysis:
Hadoop's MapReduce is leveraged to perform various analytical tasks on
the dataset. MapReduce divides the analysis into two phases: the Map
phase and the Reduce phase. During the Map phase, the dataset is
processed in parallel across the cluster, generating intermediate results. In
the Reduce phase, the intermediate results are combined to produce the
final analysis output.

5. Scalability and Performance:


Hadoop's distributed nature allows for scalability and improved
performance in handling large datasets. By distributing the data and
computation across multiple nodes, Hadoop can process vast amounts of
data in a parallel and distributed manner, reducing the overall processing
time.

6. Insights and Recommendations:


The analysis performed using Hadoop helps derive insights and
recommendations regarding fertilizer requirements and availability. The
results obtained from Hadoop-based analysis can guide policymakers,
agricultural experts, and fertilizer manufacturers in making informed
decisions regarding allocation, distribution, and production strategies.

Overall, Hadoop serves as a foundational technology in this case study,


enabling efficient storage, processing, and analysis of big data related to
fertilizer requirements and availability in India. It empowers data-driven
decision-making in the agricultural domain by handling the complexity
and scale of the dataset effectively.

OUTPUT

You might also like