0% found this document useful (0 votes)
2 views

hadoop

The document outlines the Hadoop framework, an open-source solution for distributed processing of large datasets across commodity hardware, emphasizing its scalability, fault tolerance, and cost-effectiveness. It details key components such as HDFS, YARN, and MapReduce, along with installation prerequisites, configuration steps, and cluster management. Additionally, it provides guidance on verifying cluster setup and expanding the cluster by adding more worker nodes.

Uploaded by

ddevikash2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

hadoop

The document outlines the Hadoop framework, an open-source solution for distributed processing of large datasets across commodity hardware, emphasizing its scalability, fault tolerance, and cost-effectiveness. It details key components such as HDFS, YARN, and MapReduce, along with installation prerequisites, configuration steps, and cluster management. Additionally, it provides guidance on verifying cluster setup and expanding the cluster by adding more worker nodes.

Uploaded by

ddevikash2001
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

An open source framework that allows distributed processing of

large data-sets across the cluster of Commodity Hardware

It provides a scalable, fault-tolerant, and cost-effective solution


for handling big data.

The key components of the Hadoop framework are the Hadoop


Distributed File System (HDFS), Hadoop Yarn and MapReduce
Distributed
processing

Fault
Easy to use
tolerance

Economic Scalability

Open
Source
Nodes

Master Node Slave Node


The NameNode manages the distributed file system and
knows where stored data blocks inside the cluster are.

The ResourceManager manages the YARN jobs and takes


care of scheduling and executing processes on worker nodes

The DataNode manages the physical data stored on the


node

The NodeManager manages execution of tasks on the node


Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work

Sub Work Sub Work Sub Work Sub Work


1. Prerequisites
2. Hadoop Installation
3. SSH Configuration
4. Configuration
5. Hadoop Daemons Setup
6. Verify the Cluster Setup
7. Scaling the Cluster
 GNU/Linux based operating
system

 Java Installation

 Hardware requirement like


RAM, Hard disk drives for
data storage and processors
wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-
3.1.2.tar.gz

tar -xzf hadoop-3.1.2.tar.gz

mv hadoop-3.1.2 hadoop
Enable passwordless SSH access between all machines in the cluster to
Enable facilitate communication and remote execution

Generate an SSH key pair on the machine designated as the master node
Generate • ssh-keygen -t rsa (press enter 4 times: do this in all the nodes)
• Id_rsa.pub (in .ssh directory, copy all the keys in a new file named authorized_keys)

Test SSH connectivity by logging into each machine from the master node
Test using SSH, without requiring a password.
Configure Hadoop files in both master and slave nodes
for clustering
 “etc/Hadoop” (Hadoop configuration files are located
here)
 core Hadoop configuration files
▪ hadoop-env.sh (set Java_home path)
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

▪ core-site.xml [set default file system to HDFS(fs.defaultFS),


path to tmp folder(hadoop.tmp.dir)]
<property>
<name>fs.defaultFS</name>
<value>hdfs://master_node_IP:9000</value>
</property>
▪ hdfs-site.xml (set HDFS replication factor, and
data directory)
<property>
<name>dfs.replication</name>
<value>3</value> For master node
</property> dfs.namenode.data.dir
<property>
<name>dfs.datanode.data.dir</name>
<value>/path/to/data/dir1,/path/to/data/dir2</value>
</property>

▪ yarn-site.xml (Set resource manager properties)


<property>
<name>yarn.resourcemanager.hostname</name>
<value>Master_nodeIP</value>
</property>

▪ mapred-site.xml (set map reduce frame work)


<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
 slave / worker
Add list of all the nodes in the cluster
/hadoop/etc/hadoop/workers
Node1
Node2
(not required in Hadoop 3.x.)
 /home/hadoop/hadoop/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
 /hadoop/etc/hadoop/mapred-site.xml
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>

<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>

<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>

yarn.nodemanager.resource.memory-mb 1536
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 128
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 256
mapreduce.reduce.memory.mb 256
 bin/hdfs namenode –format

 On the master node, start the NameNode


daemon and ResourceManager daemon by
running the appropriate commands (start-
dfs.sh and start-yarn.sh, respectively)
Access the Hadoop web interfaces to ensure the daemons are running correctly.
The primary interfaces are the NameNode UI (NameNode (HDFS Status) -
http://localhost:9870/

http://localhost:9870/) ResourceManager UI (ResourceManager – (YARN status)


http://localhost:8088/ http://localhost:19888/

http://localhost:8088/) , and JobHistoryServer UI(http://localhost:19888/) .

Test HDFS functionality by creating directories, uploading files, and listing or


retrieving data from HDFS using the Hadoop command-line interface (CLI).

Submit sample MapReduce jobs to validate the functionality of the MapReduce


framework.
 To expand the cluster, repeat the steps for setting up
additional worker nodes, ensuring they have the
same Hadoop installation, configuration files, and
SSH access.
 Update the configuration files (workers, yarn-
site.xml) on the master node to include the new
worker nodes.
 Restart the necessary Hadoop daemons (start-dfs.sh
and start-yarn.sh) to incorporate the changes.

You might also like