ADM Hadoop
ADM Hadoop
A Report On
“HADOOP Configuration”
MSBTE, PUNE
Page 1 of 25
Computer Engineering
CERTIFICATE
This is to certify that:
SR NO. Enrollment No Name
1 2109830054 Thorat Om Avinash
2 2209830253 Bhandari Mansi Tejkumar
3 2209830259 Inamdar Jameer Rafik
Page 2 of 25
Computer Engineering
TABLE OF CONTENTS
2. Introduction 6
3. Project Objective 7
4. What is Hadoop 8
5. Features of Hadoop 9
7. Conclusion 18
8. Reference 19
ABSTRACT
Page 3 of 25
Computer Engineering
What is Hadoop
Hadoop is an open source framework from Apache and is used to store
process and analyze data which are very huge in volume. Hadoop is written in Java
and is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and
many more.
Moreover it can be scaled up just by adding nodes in the cluster.
ACTION PLAN:
Page 4 of 25
Computer Engineering
Page 5 of 25
Computer Engineering
INTRODUCTION :
Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers. Commodity computers are cheap and widely
available. These are mainly useful for achieving greater computational power at
low cost.
Page 6 of 25
Computer Engineering
PROJECT OBJECTIVE
2) HDFS (Hadoop Distributed File System): HDFS takes care of the storage part
of Hadoop applications. MapReduce applications consume data from HDFS.
HDFS creates multiple replicas of data blocks and distributes them on compute
nodes in a cluster. This distribution enables reliable and extremely rapid
computations.
Page 7 of 25
Computer Engineering
Features Of ‘Hadoop’
• Fault Tolerance
HADOOP ecosystem has a provision to replicate the input data on to other cluster
nodes. That way, in the event of a cluster node failure, data processing can still
proceed by using data stored on another cluster node.
Page 8 of 25
Computer Engineering
Modules of Hadoop
1) HDFS: Hadoop Distributed File System. Google published its paper GFS and
on the basis of that HDFS was developed. It states that the files will be broken into
blocks and stored in nodes over the distributed architecture.
2) Yarn: Yet another Resource Negotiator is used for job scheduling and manage
the cluster.
3) Map Reduce: This is a framework which helps Java programs to do the
parallel computation on data using key value pair. The Map task takes input data
and converts it into a data set which can be computed in Key value pair. The output
of Map task is consumed by reduce task and then the out of reducer gives the
desired result.
4) Hadoop Common: These Java libraries are used to start Hadoop and are used
by other Hadoop modules.
Page 9 of 25
Computer Engineering
Installation of Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home
directory.
Step 2: Extract the Java Tar File.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
Untar Java - Install Hadoop - Edureka
Page 10 of 25
Computer Engineering
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc). Open.
bashrc file. Now, add Hadoop and Java Path as shown below.
Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.
Command: vi .bashrc
For applying all these changes to the current Terminal, execute the source command.
Page 11 of 25
Computer Engineering
To make sure that Java and Hadoop have been properly installed on your system and
can be accessed through the Terminal, execute the java -version and hadoop version
commands.
Page 12 of 25
Computer Engineering
Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster.
It contains configuration settings of Hadoop core such as I/O settings that are
common to HDFS & MapReduce.
Command: vi core-site.xml
Page 13 of 25
Computer Engineering
Command: vi hdfs-site.xml
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside
configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode,
DataNode, Secondary NameNode). It also includes the replication factor and
block size of HDFS.
Command: vi hdfs-site.xml
Page 14 of 25
Computer Engineering
Step 9: Edit the mapred-site.xml file and edit the property mentioned below
inside configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like
number of JVM that can run in parallel, the size of the mapper and the reducer
process, CPU cores available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the
mapred-site.xml file using mapred-site.xml template.
Page 15 of 25
Computer Engineering
Page 16 of 25
Computer Engineering
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and
NodeManager like application memory management size, the operation needed on
program & algorithm, etc.
You can even check out the details of Big Data with the Azure Data Engineering
Certification in Hyderabad.
Command: vi yarn-site.xml
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
Page 17 of 25
Computer Engineering
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
Page 18 of 25
Computer Engineering
CONCLUSION
Now that you have understood how to install Hadoop, check out the Hadoop admin
course by Edureka, a trusted online learning company with a network of more than
250,000 satisfied learners spread across the globe. The Edureka Big Data Engineer
Course helps learners become experts in HDFS, Yarn, MapReduce, Pig, Hive,
HBase, Oozie, Flume, and Sqoop using real-time use cases on Retail, Social
Media, Aviation, Tourism, Finance domains.
Page 19 of 25
Computer Engineering
REFERENCE :
BOOK:
1. A Guide to Measuring and Monitoring Project Performance BY Harold Kerzner
2. Advanced Database Systems By Nabil R. Adam, Bhagvan.
3. Database Systems: Design, Implementation, and Management By Peter Rob.
WEBSITE NAME:
1. https://html.scribdassets.com/8517dys11c79xnq3/images/6-420fb4cfaa.png
2. https://www.emugames.net/
3. https://www.geeksforgeeks.org/DBMS
4. https://www.tutorialspoint.com
5. https://data-flair.training/blogs/best-data-mining-books/amp/
6. https://www.guru99.com/learn-hadoop-in-10-minutes.html
Page 20 of 25
Computer Engineering
l fd
Page 19 of 19