Lab 1
Lab 1
Lab 1
Hadoop Cluster
What is Hadoop?
Hadoop is an open-source Apache project that allows creation of parallel processing
applications on large data sets, distributed across networked nodes. It is composed of the
Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of
data across nodes, and Hadoop YARN, a framework for job scheduling that executes data
processing tasks on all nodes.
Run the steps in this guide from the masterX unless otherwise specified.
2. Add a Private IP Address to each node so that your Cluster can communicate with
an additional layer of security.
3. Follow the Securing Your Server guide to harden each of the three servers. Create a
normal user for the Hadoop installation, and a user called dis for the Hadoop
daemons. Do not create SSH keys for dis users. SSH keys will be addressed in a
later section.
4. Install the JDK using the appropriate guide for your distribution, Debian, CentOS
or Ubuntu, or install the latest JDK from Oracle.
5. The steps below use example IPs for each node. Adjust each example according to
your configuration:
o masterX: 10.10.28.( X*10+10)
o workerX1: 10.10.28.( X*10+11)
o workerX2: 10.10.28.( X*10+12)
Note
- This guide is written for a non-root user. Commands that require elevated
privileges are prefixed with sudo. If you’re not familiar with the sudo command, see
the Users and Groups guide. All commands in this guide are run with the dis user if
not specified otherwise.
- SSH port for masterX is (2010+10*X)
A master node maintains knowledge about the distributed file system, like the inode table
on an ext3 filesystem, and schedules resources allocation. masterX will handle this role in
this guide, and host two daemons:
• The NameNode manages the distributed file system and knows where stored data
blocks inside the cluster are.
• The ResourceManager manages the YARN jobs and takes care of scheduling and
executing processes on worker nodes.
Worker nodes store the actual data and provide processing power to run the jobs. They’ll
be workerX1 and workerX2, and will host two daemons:
• The DataNode manages the physical data stored on the node; it’s named, NameNode.
• The NodeManager manages execution of tasks on the node.
File: /etc/hosts
1 10.10.28.(X*10+10) masterX
2 10.10.28.(X*10+10) workerX1
3 10.10.28.(X*10+10) workerX2
When generating this key, leave the password field blank so your Hadoop user can
communicate unprompted.
3. View the masterX public key and copy it to your clipboard to use with each of your
worker nodes.
4. less /home/dis/.ssh/id_rsa.pub
5. In each node, make a new file master.pub in the /home/dis/.ssh directory. Paste
your public key into this file and save your changes.
6. Copy your key file into the authorized key store.
7. cat ~/.ssh/master.pub >> ~/.ssh/authorized_keys
cd /opt
wget http://apache.cs.utah.edu/hadoop/common/current/hadoop-3.3.1.tar.gz
tar -xzf hadoop-3.3.1.tar.gz
mv hadoop-3.3.1 hadoop
File: ~.profile
1 PATH=/opt/hadoop/bin:/opt/hadoop/sbin:$PATH
2. Add Hadoop to your PATH for the shell. Edit .bashrc and add the following lines:
File: ~.bashrc
1 export HADOOP_HOME=/opt/hadoop
2 export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
Set JAVA_HOME
1. Find your Java installation path. This is known as JAVA_HOME. If you installed open-
jdk from your package manager, you can find the path with the command:
2. update-alternatives --display java
Take the value of the current link and remove the trailing /bin/java. For example on
Debian, the link is /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java, so
JAVA_HOME should be /usr/lib/jvm/java-8-openjdk-amd64/jre.
If you installed java from Oracle, JAVA_HOME is the path where you unzipped the
java archive.
File: /opt/hadoop/etc/hadoop/core-site.xml
1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>fs.default.name</name>
6 <value>hdfs://masterX:9000</value>
7 </property>
8 </configuration>
File: /opt/hadoop/etc/hadoop/hdfs-site.xml
1 <configuration>
2 <property>
3 <name>dfs.namenode.name.dir</name>
4 <value>/opt/hadoop/data/nameNode</value>
5 </property>
6
7 <property>
8 <name>dfs.datanode.data.dir</name>
9 <value>/opt/hadoop/data/dataNode</value>
10 </property>
11
12 <property>
13 <name>dfs.replication</name>
14 <value>1</value>
15 </property>
16 </configuration>
The last property, dfs.replication, indicates how many times data is replicated in the
cluster. You can set 2 to have all the data duplicated on the two nodes. Don’t enter a value
higher than the actual number of worker nodes.
Set YARN as Job Scheduler
Edit the mapred-site.xml file, setting YARN as the default framework for MapReduce
operations:
File: /opt/hadoop/etc/hadoop/mapred-site.xml
1 <configuration>
2 <property>
3 <name>mapreduce.framework.name</name>
4 <value>yarn</value>
5 </property>
6 <property>
7 <name>yarn.app.mapreduce.am.env</name>
8 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
9 </property>
10 <property>
11 <name>mapreduce.map.env</name>
12 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
13 </property>
14 <property>
15 <name>mapreduce.reduce.env</name>
16 <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
17 </property>
18 </configuration>
Configure YARN
Edit yarn-site.xml, which contains the configuration options for YARN. In the value field
for the yarn.resourcemanager.hostname, replace 203.0.113.0 with the public IP address of
masterX:
File: /opt/hadoop/etc/hadoop/yarn-site.xml
1 <configuration>
2 <property>
3 <name>yarn.acl.enable</name>
4 <value>0</value>
5 </property>
6
7 <property>
8 <name>yarn.resourcemanager.hostname</name>
9 <value>203.0.113.0</value>
10 </property>
11
12 <property>
13 <name>yarn.nodemanager.aux-services</name>
14 <value>mapreduce_shuffle</value>
15 </property>
16 </configuration>
Configure Workers
The file workers is used by startup scripts to start required daemons on all nodes. Edit
/opt/hadoop/etc/hadoop/workers to include both of the nodes:
File: /opt/hadoop/etc/hadoop/workers
Both are run in containers on worker nodes. Each worker node runs a NodeManager
daemon that’s responsible for container creation on the node. The whole cluster is
managed by a ResourceManager that schedules container allocation on all the worker-
nodes, depending on capacity requirements and current charge.
Four types of resource allocations need to be configured properly for the cluster to work.
These are:
1. How much memory can be allocated for YARN containers on a single node. This
limit should be higher than all the others; otherwise, container allocation will be
rejected and applications will fail. However, it should not be the entire amount of
RAM on the node.
2. How much memory a single container can consume and the minimum memory
allocation allowed. A container will never be bigger than the maximum, or else
allocation will fail and will always be allocated as a multiple of the minimum
amount of RAM.
4. How much memory will be allocated to each map or reduce operation. This should
be less than the maximum size.
The relationship between all those properties can be seen in the following figure:
Property Value
yarn.nodemanager.resource.memory-mb 1536
yarn.scheduler.maximum-allocation-mb 1536
yarn.scheduler.minimum-allocation-mb 128
yarn.app.mapreduce.am.resource.mb 512
mapreduce.map.memory.mb 256
mapreduce.reduce.memory.mb 256
The last property disables virtual-memory checking which can prevent containers
from being allocated properly with JDK8 if enabled.
File: /opt/hadoop/etc/hadoop/mapred-site.xml
1 <property>
2 <name>yarn.app.mapreduce.am.resource.mb</name>
3 <value>512</value>
4 </property>
5
6 <property>
7 <name>mapreduce.map.memory.mb</name>
8 <value>256</value>
9 </property>
10
11 <property>
12 <name>mapreduce.reduce.memory.mb</name>
13 <value>256</value>
14 </property>
Format HDFS
HDFS needs to be formatted like any classical file system. On masterX, run the following
command:
3. Check that every process is running with the jps command on each node. On
masterX, you should see the following (the PID number will be different):
4. 21922 Jps
5. 21603 NameNode
6. 21787 SecondaryNameNode
19728 DataNode
19819 Jps
7. To stop HDFS on master and worker nodes, run the following command from
masterX:
8. stop-dfs.sh
This will print information (e.g., capacity and usage) for all running DataNodes. To
get the description of all available commands, type:
3. You can also automatically use the friendlier web user interface. Point your browser
to http://masterX-IP:9870, where masterX-IP is the IP address of your masterX, and
you’ll get a user-friendly monitoring console.
1. Create a books directory in HDFS. The following command will create it in the home
directory, /user/dis/books:
2. hdfs dfs -mkdir books
3. Grab a few books from the Gutenberg project:
4. cd /opt/hadoop
5. wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
6. wget -O holmes.txt https://www.gutenberg.org/files/1661/1661-0.txt
7. wget -O frankenstein.txt https://www.gutenberg.org/files/84/84-0.txt
8. Put the three books through HDFS, in the books directory:
9. hdfs dfs -put alice.txt holmes.txt frankenstein.txt books
10. List the contents of the book directory:
11. hdfs dfs -ls books
12. Move one of the books to the local filesystem:
13. hdfs dfs -get books/alice.txt
14. You can also directly print the books from HDFS:
15. hdfs dfs -cat books/alice.txt
There are many commands to manage your HDFS. For a complete list, you can look at the
Apache HDFS shell documentation, or print help with:
Run YARN
HDFS is a distributed storage system, and doesn’t provide any services for running and
scheduling tasks in the cluster. This is the role of the YARN framework. The following
section is about starting, monitoring, and submitting jobs to YARN.
Monitor YARN
1. The yarn command provides utilities to manage your YARN cluster. You can also
print a report of running nodes with the command:
2. yarn node -list
To get all available parameters of the yarn command, see Apache YARN
documentation.
3. As with HDFS, YARN provides a friendlier web UI, started by default on port 8088
of the Resource Manager. Point your browser to http://masterX-IP:8088, where
masterX-IP is the IP address of your masterX, and browse the UI:
The last argument is where the output of the job will be saved - in HDFS.
3. After the job is finished, you can get the result by querying HDFS with hdfs dfs -
ls output. In case of a success, the output will resemble:
4. Found 2 items
5. -rw-r--r-- 2 hadoop supergroup 0 2019-05-31 17:21
output/_SUCCESS
6. -rw-r--r-- 2 hadoop supergroup 789726 2019-05-31 17:21 output/part-
r-00000
7. Print the result with:
8. hdfs dfs -cat output/part-r-00000 | less
Next Lab
• Install Spark on top on your YARN cluster.
More Information
You may wish to consult the following resources for additional information on this topic.
While these are provided in the hope that they will be useful, please note that we cannot
vouch for the accuracy or timeliness of externally hosted materials.