Hadoop 3
Hadoop 3
Cluster Setup
Purpose
Prerequisites
Installation
Configuration
Configuration Files
Site Configuration
Configuring the Environment of the Hadoop Daemons
Configuring the Hadoop Daemons
Memory monitoring
Slaves
Logging
Cluster Restartability
MapReduce
Hadoop Rack Awareness
Hadoop Startup
Hadoop Shutdown
Purpose
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large
clusters with thousands of nodes.
To play with Hadoop, you may first want to install Hadoop on a single machine (see Single Node Setup).
Prerequisites
1. Make sure all required software is installed on all nodes in your cluster.
2. Download the Hadoop software.
Installation
Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster.
Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively. These are
the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker. These are the slaves.
The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path.
Configuration
The following sections describe how to configure a Hadoop cluster.
Configuration Files
Hadoop configuration is driven by two types of important configuration files:
To learn more about how the Hadoop framework is controlled by these configuration files, look here.
Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the
conf/hadoop-env.sh.
Site Configuration
To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the
configuration parameters for the Hadoop daemons.
At the very least you should specify the JAVA_HOME so that it is correctly defined on each remote node.
In most cases you should also specify HADOOP_PID_DIR to point a directory that can only be written to by the users that are going to
run the hadoop daemons. Otherwise there is the potential for a symlink attack.
Administrators can configure individual daemons using the configuration options HADOOP_*_OPTS. Various options available are
shown below in the table.
HADOOP_LOG_DIR - The directory where the daemons' log files are stored. They are automatically created if they don't exist.
HADOOP_HEAPSIZE - The maximum amount of heapsize to use, in MB e.g. 1000MB. This is used to configure the heap size for
the hadoop daemon. By default, the value is 1000MB.
conf/hdfs-site.xml:
dfs.name.dirPath on the local filesystem where the NameNode stores If this is a comma-delimited list of directories then the name table
the namespace and transactions logs persistently. is replicated in all of the directories, for redundancy.
dfs.data.dir Comma separated list of paths on the local filesystem of a If this is a comma-delimited list of directories, then data will be
DataNode where it should store its blocks. stored in all named directories, typically on different devices.
conf/mapred-site.xml:
conf/mapred-queue-acls.xml
Some non-default configuration values used to run sort900, that is 9TB of data sorted on a cluster with 900 nodes:
Task Controllers
Task controllers are classes in the Hadoop MapReduce framework that define how user's map and reduce tasks are launched and
controlled. They can be used in clusters that require some customization in the process of launching or controlling the user tasks. For
example, in some clusters, there may be a requirement to run tasks as the user who submitted the job, instead of as the task tracker user,
which is how tasks are launched by default. This section describes how to configure and use task controllers.
In order to use the LinuxTaskController, a setuid executable should be built and deployed on the compute nodes. The executable is
named task-controller. To build the executable, execute ant task-controller -Dhadoop.conf.dir=/path/to/conf/dir. The path passed in
-Dhadoop.conf.dir should be the path on the cluster nodes where a configuration file for the setuid executable would be located. The
executable would be built to build.dir/dist.dir/bin and should be installed to $HADOOP_HOME/bin.
The executable must have specific permissions as follows. The executable should have 6050 or --Sr-s--- permissions user-owned by
root(super-user) and group-owned by a special group of which the TaskTracker's user is the group member and no job submitter is. If
any job submitter belongs to this special group, security will be compromised. This special group name should be specified for the
configuration property "mapreduce.tasktracker.group" in both mapred-site.xml and task-controller.cfg. For example, let's say that the
TaskTracker is run as user mapred who is part of the groups users and specialGroup any of them being the primary group. Let also be that
users has both mapred and another user (job submitter) X as its members, and X does not belong to specialGroup. Going by the above
description, the setuid/setgid executable should be set 6050 or --Sr-s--- with user-owner as mapred and group-owner as specialGroup which
has mapred as its member(and not users which has X also as its member besides mapred).
The LinuxTaskController requires that paths including and leading up to the directories specified in mapred.local.dir and hadoop.log.dir to
be set 755 permissions.
task-controller.cfg
The executable requires a configuration file called taskcontroller.cfg to be present in the configuration directory passed to the ant target
mentioned above. If the binary was not built with a specific conf directory, the path defaults to /path-to-binary/../conf. The configuration
file must be owned by root, group-owned by anyone and should have the permissions 0400 or r--------.
The executable requires following configuration items to be present in the taskcontroller.cfg file. The items should be mentioned as simple
key=value pairs.
Name Description
hadoop.log.dir Path to hadoop log directory. Should be same as the value which the TaskTracker is started with. This is
required to set proper permissions on the log files so that they can be written to by the user's tasks and read
by the TaskTracker for serving on the web UI.
mapreduce.tasktracker.groupGroup to which the TaskTracker belongs. The group owner of the taskcontroller binary should be this
group. Should be same as the value with which the TaskTracker is configured. This configuration is
required for validating the secure access of the task-controller binary.
Name Description
mapred.healthChecker.script.path Absolute path to the script which is periodically run by the TaskTracker to
determine if the node is healthy or not. The file should be executable by the
TaskTracker. If the value of this key is empty or the file does not exist or is not
executable, node health monitoring is not started.
mapred.healthChecker.interval Frequency at which the node health script is run, in milliseconds
mapred.healthChecker.script.timeout Time after which the node health script will be killed by the TaskTracker if
unresponsive. The node is marked unhealthy. if node health script times out.
mapred.healthChecker.script.args Extra arguments that can be passed to the node health script when launched. These
should be comma separated list of arguments.
Memory monitoring
A TaskTracker(TT) can be configured to monitor memory usage of tasks it spawns, so that badly-behaved jobs do not bring down a
machine due to excess memory consumption. With monitoring enabled, every task is assigned a task-limit for virtual memory (VMEM).
In addition, every node is assigned a node-limit for VMEM usage. A TT ensures that a task is killed if it, and its descendants, use VMEM
over the task's per-task limit. It also ensures that one or more tasks are killed if the sum total of VMEM usage by all tasks, and their
descendants, cross the node-limit.
Users can, optionally, specify the VMEM task-limit per job. If no such limit is provided, a default limit is used. A node-limit can be set per
node.
Currently the memory monitoring and management is only supported in Linux platform.
To enable monitoring for a TT, the following parameters all need to be set:
1. If one or more of the configuration parameters described above are missing or -1 is specified , memory monitoring is disabled for
the TT.
2. Periodically, the TT checks the following:
If any task's current VMEM usage is greater than that task's VMEM task-limit, the task is killed and reason for killing the task
is logged in task diagonistics . Such a task is considered failed, i.e., the killing counts towards the task's failure count.
If the sum total of VMEM used by all tasks and descendants is greater than the node-limit, the TT kills enough tasks, in the
order of least progress made, till the overall VMEM usage falls below the node-limit. Such killed tasks are not considered
failed and their killing does not count towards the tasks' failure counts.
Schedulers can choose to ease the monitoring pressure on the TT by preventing too many tasks from running on a node and by
scheduling tasks only if the TT has enough VMEM free. In addition, Schedulers may choose to consider the physical memory (RAM)
available on the node as well. To enable Scheduler support, TTs report their memory settings to the JobTracker in every heartbeat.
Slaves
Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively.
The rest of the machines act as both a DataNode and TaskTracker and are referred to as slaves.
List all slave hostnames or IP addresses in your conf/slaves file, one per line.
Logging
Hadoop uses the Apache log4j via the Apache Commons Logging framework for logging. Edit the conf/log4j.properties file to
customize the Hadoop daemons' logging configuration (log-formats and so on). Edit conf/task-log4j.properties file to customize the
logging configuration for MapReduce tasks.
History Logging
The job history files are stored in central location hadoop.job.history.location which can be on DFS also, whose default
value is ${HADOOP_LOG_DIR}/history. The history web UI is accessible from job tracker web UI.
The history files are also logged to user specified directory hadoop.job.history.user.location which defaults to job output
directory. The files are stored in "_logs/history/" in the specified directory. Hence, by default they will be in "mapred.output.dir/_logs
/history/". User can stop logging by giving the value none for hadoop.job.history.user.location
User can view the history logs summary in specified directory using the following command
$ bin/hadoop job -history output-dir
This command will print job details, failed and killed tip details.
More details about the job such as successful tasks and task attempts made for each task can be viewed using the following command
$ bin/hadoop job -history all output-dir
Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines, typically
${HADOOP_HOME}/conf.
Cluster Restartability
MapReduce
The job tracker restart can recover running jobs if mapred.jobtracker.restart.recover is set true and JobHistory logging is
enabled. Also mapred.jobtracker.job.history.block.size value should be set to an optimal value to dump job history to
disk as soon as possible, the typical value is 3145728(3MB).
The NameNode and the JobTracker obtains the rack id of the slaves in the cluster by invoking an API resolve in an administrator
configured module. The API resolves the slave's DNS name (also IP address) to a rack id. What module to use can be configured using
the configuration item topology.node.switch.mapping.impl. The default implementation of the same runs a script/command
configured using topology.script.file.name. If topology.script.file.name is not set, the rack id /default-rack is returned
for any passed IP address. The additional configuration in the Map/Reduce part is mapred.cache.task.levels which determines
the number of levels (in the network topology) of caches. So, for example, if it is the default value of 2, two levels of caches will be
constructed - one for hosts (host -> task mapping) and another for racks (rack -> task mapping).
Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster.
Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the
DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh
The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the
TaskTracker daemon on all the listed slaves.
Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode
daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh
The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the
TaskTracker daemon on all the listed slaves.