Bda Aat
Bda Aat
Post Box No.: 1908, Bull Temple Road, Bengaluru – 560 019
AAT
HADOOP
Submitted by
Student Name:
Team Member 1: A Vaishnavi
Team Member 2: Abitha Bala Subramani
USN:
Team Member 1: 1BM20AI053
Team Member 2: 1BM20AI065
Semester & Section: 6A
Student Signature:
Team Member 1:
Team Member 2:
fi
fi
Introduction
Hadoop has gained popularity due to its ability to handle massive volumes of data
and its scalability. It allows organizations to store, process, and analyze data that
would be impractical or infeasible to handle using traditional databases or single-
node systems. Hadoop is widely used in various industries, including nance,
healthcare, retail, social media, and more, for tasks like log processing, data
warehousing, recommendation systems, and large-scale data analytics.
fi
fi
fi
Working Principle
The working principle of Hadoop revolves around its key components: the Hadoop
Distributed File System (HDFS) and the MapReduce processing framework. Let's
break down the working principles of each component:
The overall working principle of Hadoop involves storing large datasets across a
distributed le system (HDFS) and processing the data using the MapReduce
framework. HDFS ensures data reliability and fault tolerance, while MapReduce
enables parallel processing of the data for distributed computing. This combination
fl
fi
fi
fi
fi
fl
fi
allows Hadoop to handle large-scale data processing tasks ef ciently and reliably
across a cluster of machines.
Installation
Step 1: Download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
fi
Command: vi .bashrc
For applying all these changes to the current Terminal, execute the source
command.
To make sure that Java and Hadoop have been properly installed
on your system and can be accessed through the Terminal, execute
the java -version and hadoop version commands.
Command: java -version
Command: ls
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the
cluster. It contains configuration settings of Hadoop core such as I/O
settings that are common to HDFS & MapReduce.
Command: vi core-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e.
NameNode, DataNode, Secondary NameNode). It also includes the
replication factor and block size of HDFS.
Command: vi hdfs-site.xml
Step 9: Edit the mapred-site.xml file and edit the property mentioned
below inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce application
like number of JVM that can run in parallel, the size of the mapper and the
reducer process, CPU cores available for a process, etc.
Command: vi mapred-site.xml.
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation
needed on program & algorithm, etc.
Command: vi yarn-site.xml
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the
script to run Hadoop like Java home path, etc.
Command: vi hadoop–env.sh
Command: cd hadoop-2.7.3
This formats the HDFS via NameNode. This command is only executed
for the first time. Formatting the file system means initializing the
directory specified by the dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your
data stored in the HDFS.
Command: ./start-all.sh
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the
directory tree of all files stored in the HDFS and tracks all the file stored
across the cluster.
Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the
requests from the Namenode for different operations.
Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster
resources and thus helps in managing the distributed applications running
on the YARN system. Its work is to manage each NodeManagers and the
each application’s ApplicationMaster.
Start NodeManager:
The NodeManager in each machine framework is the agent which is
responsible for managing containers, monitoring their resource usage and
reporting the same to the ResourceManager.
Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related
requests from client.
Step 14: To check that all the Hadoop services are up and running, run the
below command.
Command: jps
CASE STUDY
Abstract:
This case study explores the application of big data analytics using Apache
Hadoop to analyse the fertiliser requirements and availability across
various states in India from 2012-2013 to 2014-2015. The study aims to
identify patterns, trends, and insights that can help policymakers and
stakeholders make informed decisions regarding fertiliser allocation and
distribution.
1. Introduction:
The agriculture sector plays a crucial role in India's economy, and fertiliser
usage is a critical factor in achieving higher agricultural productivity. By
harnessing big data analytics, this study seeks to understand the fertiliser
requirements and availability in different states of India during the
specified time period.
2. Methodology:
5. Conclusion:
The case study demonstrates the application of big data analytics using
Apache Hadoop for analysing fertiliser requirements and availability in
various states of India. The study's findings offer valuable insights into
optimising fertiliser allocation and distribution, ultimately contributing to
improved agricultural productivity and sustainability.
In the above case study, Apache Hadoop is used as the primary big data
framework to store, process, and analyse the large-scale dataset related to
fertiliser requirements and availability in different states of India from
2012-2013 to 2014-2015. Here's how Hadoop is utilised:
1. Data Storage:
Hadoop Distributed File System (HDFS) is employed to store the dataset.
HDFS is a distributed file system designed to store large volumes of data
across multiple machines in a cluster. It provides fault tolerance and high
scalability, allowing the dataset to be partitioned and distributed across the
cluster.
2. Data Processing:
3. Data Preprocessing:
Before analysis, the collected data is preprocessed using Hadoop. This
involves cleaning the data, handling missing values, removing outliers, and
transforming the data to ensure consistency and compatibility. Hadoop's
distributed processing capability helps handle the preprocessing tasks
efficiently.
4. Analysis:
Hadoop's MapReduce is leveraged to perform various analytical tasks on
the dataset. MapReduce divides the analysis into two phases: the Map
phase and the Reduce phase. During the Map phase, the dataset is
processed in parallel across the cluster, generating intermediate results. In
the Reduce phase, the intermediate results are combined to produce the
final analysis output.
OUTPUT