Monitoring Hadoop Using Ambari
Monitoring Hadoop Using Ambari
Monitoring Hadoop Using Ambari
com
Hortonworks Data Platform Jul 15, 2014
The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open
source platform for storing, processing and analyzing large volumes of data. It is designed to deal with
data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks
Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop
Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, Zookeeper and Ambari. Hortonworks is the
major contributor of code and patches to many of these projects. These projects have been integrated and
tested as part of the Hortonworks Data Platform release process and installation and configuration tools
have also been included.
Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our
code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and
completely open source. We sell only expert technical support, training and partner-enablement services.
All of our technology is, and will remain free and open source.
Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For
more information on Hortonworks services, please visit either the Support or Training page. Feel free to
Contact Us directly to discuss your specific needs.
Except where otherwise noted, this document is licensed under
Creative Commons Attribution ShareAlike 3.0 License.
http://creativecommons.org/licenses/by-sa/3.0/legalcode
ii
Hortonworks Data Platform Jul 15, 2014
Table of Contents
1. Introducing Ambari Web ............................................................................................. 1
1.1. Architecture ..................................................................................................... 1
1.1.1. Sessions ................................................................................................. 1
1.2. Starting and Accessing Ambari Web ................................................................. 2
2. Monitoring and Managing HDP Clusters Using Ambari Web ........................................ 3
2.1. Viewing Metrics on the Dashboard ................................................................... 3
2.1.1. Scanning System Metrics ........................................................................ 3
2.1.2. Viewing Heatmaps ............................................................................... 10
2.1.3. Scanning Services Status ....................................................................... 11
2.2. Monitoring and Managing Services ................................................................. 11
2.2.1. Starting and Stopping All Services ........................................................ 13
2.2.2. Selecting a Service ............................................................................... 13
2.2.3. Viewing Summary, Alert, and Health Information ................................. 17
2.2.4. Configuring Services ............................................................................. 19
2.3. Managing Hosts ............................................................................................. 27
2.3.1. Working with Hosts ............................................................................. 27
2.3.2. Determining Host Status ...................................................................... 27
2.3.3. Filtering the Hosts List ......................................................................... 28
2.3.4. Performing Host-Level Actions .............................................................. 28
2.3.5. Viewing Components on a Host ........................................................... 29
2.3.6. Decommissioning Masters and Slaves ................................................... 31
2.3.7. Deleting a Host from a Cluster ............................................................. 32
2.3.8. Setting Maintenance Mode .................................................................. 33
2.3.9. Adding Hosts to a Cluster .................................................................... 36
2.4. Administering Ambari ..................................................................................... 36
2.4.1. Managing Ambari Web Users .............................................................. 37
2.4.2. Enabling High Availability of HDP Components ..................................... 38
2.4.3. Enabling Kerberos Security ................................................................... 38
2.4.4. Checking Stack and Component Versions ............................................. 39
2.4.5. Managing Stack Repositories ............................................................... 40
2.4.6. Checking Service User Accounts and Groups ......................................... 42
2.4.7. Accessing Jobs Monitoring Information ................................................ 43
3. Using Nagios With Hadoop ....................................................................................... 45
3.1. Basic Nagios Architecture ................................................................................ 45
3.2. Installing Nagios ............................................................................................. 46
3.3. Configuration File Locations ............................................................................ 46
3.4. Configuring Nagios Alerts For Hadoop Services ............................................... 46
3.5. Nagios Alerts For Hadoop Services .................................................................. 47
3.5.1. HDFS Service Alerts .............................................................................. 47
3.5.2. NameNode HA Alerts (Hadoop 2 only) ................................................. 52
3.5.3. YARN Alerts (Hadoop 2 only) .............................................................. 52
3.5.4. MapReduce2 Alerts (Hadoop 2 only) .................................................... 55
3.5.5. MapReduce Service Alerts (Hadoop 1 only) .......................................... 57
3.5.6. HBase Service Alerts ............................................................................. 59
3.5.7. Hive Alerts ........................................................................................... 61
3.5.8. WebHCat Alerts ................................................................................... 62
3.5.9. Oozie Alerts ......................................................................................... 62
3.5.10. Ganglia Alerts .................................................................................... 63
iii
Hortonworks Data Platform Jul 15, 2014
iv
Hortonworks Data Platform Jul 15, 2014
List of Figures
1.1. Architectural Overview ............................................................................................. 1
v
Hortonworks Data Platform Jul 15, 2014
List of Tables
2.1. Ambari Service Metrics and Descriptions ................................................................... 4
2.2. Ambari Cluster-Wide Metrics and Descriptions .......................................................... 6
2.3. Links to More Metrics for HDP Services ..................................................................... 9
2.4. Service Status ......................................................................................................... 11
2.5. Host Roles Required for Added Services .................................................................. 15
2.6. Validation Rules for Rolling Restart Parameters ....................................................... 24
vi
Hortonworks Data Platform Jul 15, 2014
Note
At this time, Ambari Web is supported only in deployments made using the
Ambari Install Wizard.
1.1. Architecture
The Ambari Server serves as the collection point for data from across your cluster. Each
host has a copy of the Ambari Agent - either installed automatically by the Install wizard or
manually - which allows the Ambari Server to control each host. In addition, each host has
a copy of Ganglia Monitor (gmond), which collects metric information that is passed to the
Ganglia Connector, and then on to the Ambari Server.
Figure 1.1. Architectural Overview
1.1.1. Sessions
Ambari Web is a client-side JavaScript application, which calls the Ambari REST API
(accessible from the Ambari Server) to access cluster information and perform cluster
operations. After authenticating to Ambari Web, the application authenticates to the
Ambari Server and communication between the browser and server occur asynchronously
via the REST API.
Note
Ambari Web sessions do not timeout since the application is constantly
accessing the REST API, which resets the session timeout. As well, if there is a
1
Hortonworks Data Platform Jul 15, 2014
To access Ambari Web, open a supported browser and enter the Ambari Web URL:
http://{your.ambari.server}:8080
Enter your user name and password. If this is the first time Ambari Web is accessed, use the
default values, admin/admin. These values can be changed, and new users provisioned,
using the Admin view in Ambari Web itself.
2
Hortonworks Data Platform Jul 15, 2014
• Managing Hosts
• Monitoring Jobs
• Administering Ambari
• Viewing Heatmaps
Note
Metrics data for Storm is buffered and sent as a batch to Ambari every 5
minutes. After adding the Storm service, anticipate a five-minute delay for
Storm metrics to appear.
You can add and remove individual widgets, and rearrange the mashup by dragging and
dropping each widget to a new location in the mashup.
3
Hortonworks Data Platform Jul 15, 2014
Status information appears as simple pie and bar charts, more complex charts showing
usage and load, sets of links to additional data sources, and values for operating
parameters such as uptime and average RPC queue wait times. Most widgets displays a
single fact by default. For example, HDFS Disk Usage displays a load chart and a percentage
figure. The Ambari Dashboard includes metrics for the following services:
4
Hortonworks Data Platform Jul 15, 2014
Metric: Description:
HBase Master Heap The percentage of NameNode JVM Heap used.
HBase Ave Load The average load on the HBase server.
HBase Master Uptime The HBase Master uptime calculation.
Region in Transition The number of HBase regions in transition.
Stormb
Supervisors Live The number of Supervisors live, as reported from the
Nimbus server.
MapReducec
JobTracker Heap The percentage of JobTracker JVM Heap used.
TaskTrackers Live The number of TaskTrackers live, as reported from the
JobTracker.
a
HDP 2.0 and 2.1 Stacks
b
HDP 2.1 Stack
c
HDP 1.3 Stack
More detailed information about the service displays, as shown in the following example:
• To edit the display of information in a widget, click the pencil icon. For more information
about editing a widget, see Customizing Metrics Display .
5
Hortonworks Data Platform Jul 15, 2014
• Hover your cursor over each cluster-wide metric to magnify the chart or itemize the
widget display.
• To remove or add metric items from each cluster-wide metric widget, select the item on
the widget legend.
• To see a larger view of the chart, select the magnifying glass icon.
Ambari displays a larger version of the widget in a pop-out window, as shown in the
following example:
6
Hortonworks Data Platform Jul 15, 2014
Use the pop-up window in the same ways that you use cluster-wide metric widgets on the
dashboard.
2. Choose Add.
4. Choose Apply.
7
Hortonworks Data Platform Jul 15, 2014
2. Choose Edit.
2. Choose Edit.
2. Select the pencil-shaped, edit icon that appears in the upper-right corner.
The Customize Widget pop-up window displays properties that you can edit, as shown in
the following example.
8
Hortonworks Data Platform Jul 15, 2014
In this example, you can adjust the thresholds at which the HDFS Capacity bar chart
changes color, from green to orange to red.
Note
Not all widgets support editing.
Choose the More drop-down to select from the list of links available for each service. The
Ambari Dashboard includes More links to metrics for the following services:
9
Hortonworks Data Platform Jul 15, 2014
2.1.2. Viewing Heatmaps
Heatmaps provides a graphical representation of your overall cluster utilization using
simple color coding.
A colored block represents each host in your cluster. To see more information about a
specific host, hover over the block representing the host in which you are interested. A pop-
up window displays metrics about HDP components installed on that host. Colors displayed
in the block represent usage in a unit appropriate for the selected set of metrics. If any data
necessary to determine state is not available, the block displays "Invalid Data". Changing
the default maximum values for the heatmap lets you fine tune the representation. Use the
Select Metric drop-down to select the metric type.
10
Hortonworks Data Platform Jul 15, 2014
Metric Uses
Host/Disk Space Used % disk.disk_free and disk.disk_total
Host/Memory Used % memory.mem_free and memory.mem_total
Host/CPU Wait I/O % cpu.cpu_wio
HDFS/Bytes Read dfs.datanode.bytes_read
HDFS/Bytes Written dfs.datanode.bytes_written
HDFS/Garbage Collection Time jvm.gcTimeMillis
HDFS/JVM Heap MemoryUsed jvm.memHeapUsedM
YARN/Garbage Collection Time jvm.gcTimeMillis
YARN / JVM Heap Memory Used jvm.memHeapUsedM
YARN / Memory used % UsedMemoryMB and AvailableMemoryMB
HBase/RegionServer read request count hbase.regionserver.readRequestsCount
HBase/RegionServer write request count hbase.regionserver.writeRequestsCount
HBase/RegionServer compaction queue size hbase.regionserver.compactionQueueSize
HBase/RegionServer regions hbase.regionserver.regions
HBase/RegionServer memstore sizes hbase.regionserver.memstoreSizeMB
Table 2.4. Service Status
Color Name Status
Solid Green All masters are running
Click the service name to open the Services screen, where you can see more detailed
information on each service.
All services installed in your cluster are listed in the leftmost Services panel.
11
Hortonworks Data Platform Jul 15, 2014
• Selecting a Service
• Configuring Services
• Rolling Restarts
12
Hortonworks Data Platform Jul 15, 2014
2.2.2. Selecting a Service
Selecting a service name from the list shows current summary, alert, and health information
for the selected service. To refresh the monitoring panels and show information about a
different service, select a different service name from the list.
Notice the colored dot next to each service name, indicating service operating status and a
small, red, numbered rectangle indicating any alerts generated for the service.
2.2.2.1. Adding a Service
The Ambari install wizard installs all available Hadoop services by default. You may choose
to deploy only some services initially, then add other services at later times. For example,
many customers deploy only core Hadoop services initially. Add Service supports deploying
additional services without interrupting operations in your Hadoop cluster. When you have
deployed all available services, Add Service displays disabled.
For example, if you are using HDP 2.1 Stack and did not install Falcon or Storm, you can use
the Add Service capability to add those services to your cluster.
Note
After installing Storm via Ambari, you should configure the Storm service to run
as a supervised service.
To add a service, select Actions -> Add Service, then complete the following procedure
using the Add Service Wizard.
1. Choose Services.
Choose an available service. Alternatively, choose all to add all available services to your
cluster. Then, choose Next.
13
Hortonworks Data Platform Jul 15, 2014
The Add Services wizard displays installed services highlighted green and marked
unavailable for selection.
2. In Assign Masters, confirm the default host assignment. Alternatively, choose a different
host machine to which master components for your selected service will be added. Then,
choose Next.
The Add Services Wizard indicates hosts on which the master components for a chosen
service will be installed. A service chosen for addition displays as:
• A green label located on the host to which its master components will be added, or
14
Hortonworks Data Platform Jul 15, 2014
3. In Assign Slaves and Clients, accept the default assignment of slave and client
components to hosts. Then, choose Next.
Alternatively, select hosts to which you want to assign slave and client components.
You must select at least one host for the slave of each service being added.
The Add Service Wizard skips and disables the Assign Slaves and Clients step for a service
requiring no slave nor client assignment.
4. In Customize Services, accept the default configuration properties. Then, choose Next.
15
Hortonworks Data Platform Jul 15, 2014
5. In Review, make sure the configuration settings match your intentions. Then, choose
Deploy.
6. Monitor the progress of installing, starting, and testing the service. When the service
installs and starts successfully, choose Next.
16
Hortonworks Data Platform Jul 15, 2014
8. Restart the Nagios service and any other components having stale configurations.
Important
If you do not restart Nagios service after completing the Add Service Wizard,
alerts and notifications may not work properly.
17
Hortonworks Data Platform Jul 15, 2014
Select a View Host link, as shown in the following example, to view components and the
host on which the selected service is running.
18
Hortonworks Data Platform Jul 15, 2014
2.2.4. Configuring Services
Select a service, then select Configs to view and update configuration properties for the
selected service. For example, select MapReduce2, then select Configs. Expand a config
category to view configurable service properties.
2. Edit values for one or more properties that have the Override option.
3. Choose Save.
19
Hortonworks Data Platform Jul 15, 2014
4. Select Components and choose from available Hosts to add hosts to the new group.
Select Configuration Group Hosts enforces host membership in each group, based on
installed components for the selected service.
20
Hortonworks Data Platform Jul 15, 2014
5. Choose OK.
2. Select a Config Group, then expand components to expose settings that allow Override.
a. Select an existing configuration group (to which the property value override provided
in step 3 will apply), or
b. Create a new configuration group (which will include default properties, plus the
property override provided in step 3).
21
Hortonworks Data Platform Jul 15, 2014
2.2.4.4. Restarting components
After editing and saving a service configuration, Restart indicates components that you
must restart.
Select the Components or Hosts links to view details about components or hosts requiring a
restart.
Then, choose an option appearing in Restart. For example, options to restart YARN
components include:
Optionally, choose Turn On Maintenance Mode to suppress alerts about this service before
performing a service action. Maintenance Mode suppresses alerts and status indicator
changes generated by the service, while allowing you to start, stop, restart, move, or
perform maintenance tasks on the service. For more information about how Maintenance
Mode affects bulk operations for host components, see Maintenance Mode.
2.2.4.6. Rolling Restarts
When you restart multiple services, components, or hosts, use rolling restarts to distribute
the task; minimizing cluster downtime and service disruption. A rolling restart stops,
22
Hortonworks Data Platform Jul 15, 2014
1. Select a Service, then link to a lists of specific components or hosts that Require Restart.
4. Optionally, reset the flag to only restart components with changed configurations.
If you trigger a rolling restart of components, Restart components with stale configs
defaults to true. If you trigger a rolling restart of services, Restart services with stale configs
defaults to false.
23
Hortonworks Data Platform Jul 15, 2014
To abort future restart operations in the batch, choose Abort Rolling Restart.
Background Operations opens by default when you run a job that executes bulk
operations.
1. Select the right-arrow for each operation to show restart operation progress on each
host.
24
Hortonworks Data Platform Jul 15, 2014
2. After restarts complete, Select the right-arrow, or a host name, to view log files and any
error messages generated on the selected host.
3. Select links at the upper-right to copy or open text files containing log and error
information.
25
Hortonworks Data Platform Jul 15, 2014
Optionally, select the option to not show the bulk operations dialog.
26
Hortonworks Data Platform Jul 15, 2014
information about using metrics widgets, see Scanning System Metrics. To see more metrics
information, select the link located at the upper right of the Metrics panel that opens he
native Ganglia GUI.
2.3. Managing Hosts
Use Ambari Hosts to manage multiple HDP components such as DataNodes, NameNodes,
TaskTrackers and RegionServers, running on hosts throughout your cluster. For example,
you can restart all DataNode components, optionally controlling that task with rolling
restarts. Ambari Hosts supports filtering your selection of host components, based on
operating status, host health, and defined host groupings.
View individual hosts, listed by fully-qualified domain name, on the Hosts landing page.
• Red - At least one master component on that host is down. Hover to see a tooltip that
lists affected components.
• Orange - At least one slave component on that host is down. Hover to see a tooltip that
lists affected components.
• Yellow - Ambari Server has not received a heartbeat from that host for more than 3
minutes.
A red condition flag overrides an orange condition flag, which overrides a yellow condition
flag. In other words, a host having a master component down may also have other issues.
The following example shows three hosts, one having a master component down, one
27
Hortonworks Data Platform Jul 15, 2014
having a slave component down, and one healthy. Warning indicators appear next to hosts
having a component down.
For example, to limit the list of hosts appearing on Hosts home to only those with Healthy
status, select Filters, then choose the Healthy option. In this case, one host name appears
on Hosts home. Alternatively, to limit the list of hosts appearing on Hosts home to only
those having Maintenance Mode on, select Filters, then choose the Maintenance Mode
option. In this case, three host names appear on Hosts home.
Use the general filter tool to apply specific search and sort criteria that limits the list of
hosts appearing on the Hosts page.
Actions comprises three menus that list the following options types:
• Hosts - lists selected, filtered or all hosts options, based on your selections made using
Hosts home and Filters.
• Objects - lists component objects that match your host selection criteria.
• Operations - lists all operations available for the component objects you selected.
28
Hortonworks Data Platform Jul 15, 2014
2. In Actions, choose Selected Hosts > DataNodes > Restart, as shown in the following
image.
29
Hortonworks Data Platform Jul 15, 2014
Choose options in Host Actions, to start, stop, restart, delete, or turn on maintenance
mode for all components installed on the selected host.
Alternatively, choose action options from the drop-down menu next to an individual
component on a host. The drop-down menu shows current operation status for each
component, For example, you can decommission, restart, or stop the DataNode
component (started) for HDFS, by selecting one of the options shown in the following
example:
30
Hortonworks Data Platform Jul 15, 2014
• DataNodes
• NodeManagers
• TaskTrackers
• RegionServers
• For DataNodes, safely replicates the HDFS data to other DataNodes in the cluster.
• For NodeManagers and TaskTrackers, stops accepting new job requests from the masters
and stops the component.
For example:
The UI shows "Decommissioning" status while steps process, then "Decommissioned" when
complete.
31
Hortonworks Data Platform Jul 15, 2014
Note
A decommissioned slave component may restart in the decommissioned
state.
Note
Restarting services enables Ambari to recognize and monitor the correct
number of components.
Deleting a slave component, such as a DataNode does not automatically inform a master
component, such as a NameNode to remove the slave component from its exclusion list.
Adding a deleted slave component back into the cluster presents the following issue; the
added slave remains decommissioned from the master's perspective. Restart the master
component, as a work-around.
• Move from the host any master components, such as NameNode or ResourceManager,
running on the host.
3. Choose Delete.
If you have not completed prerequisite steps, a warning message similar to the following
one appears:
32
Hortonworks Data Platform Jul 15, 2014
Maintenance Mode affects a service, component, or host object in the following two ways:
• Maintenance Mode suppresses alerts, warnings and status change indicators generated
for the object
Explicitly turning on Maintenance Mode for a service implicitly turns on Maintenance Mode
for components and hosts that run the service. While Maintenance Mode On prevents bulk
operations being performed on the service, component, or host, you may explicitly start
and stop a service, component, or host having Maintenance Mode On.
3. Choose OK to confirm.
Notice, on Services Summary that Maintenance Mode turns on for the NameNode and
SNameNode components.
33
Hortonworks Data Platform Jul 15, 2014
3. Choose OK to confirm.
2.3.8.1.3. How to Turn On Maintenance Mode for a Host (alternative using filtering for
hosts)
1. Using Hosts, select c6403.ambari.apache.org.
2. In Actions -> Selected Hosts -> Hosts choose Turn On Maintenance Mode.
3. Choose OK to confirm.
Your list of Hosts now shows Maintenance Mode On for hosts c6401 and c6403.
• Hover your cursor over each Maintenance Mode icon appearing in the Hosts list.
• Notice that hosts c6401 and c6403 have Maintenance Mode On.
• Notice that on host c6401; Ganglia Monitor, HbaseMaster, HDFS client, NameNode,
and Zookeeper Server have Maintenance Mode turned On.
• Notice on host c6402, that HDFS client and Secondary NameNode have Maintenance
Mode On.
• DataNode is skipped from all Bulk Operations except Turn Maintenance Mode ON/
OFF.
34
Hortonworks Data Platform Jul 15, 2014
To achieve these goals, turn On Maintenance Mode explicitly for the host. Putting a host
in Maintenance Mode implicitly puts all components on that host in Maintenance Mode.
2. You want to test a service configuration change. You will stop, start, and restart the
service using a rolling restart to test whether restarting picks up the change.
You want:
To achieve these goals, turn on Maintenance Mode explicitly for the service. Putting a
service in Maintenance Mode implicitly turns on Maintenance Mode for all components
in the service.
You want:
• To ensure that no components start, stop, or restart due to host-level actions or bulk
operations.
To achieve these goals, turn On Maintenance Mode explicitly for the service. Putting a
service in Maintenance Mode implicitly turns on Maintenance Mode for all components
in the service.
35
Hortonworks Data Platform Jul 15, 2014
• Prevent alerts generated by the component while you check its condition.
To achieve these goals, turn on Maintenance Mode explicitly for the host component.
Putting a host component in Maintenance Mode prevents host-level and service-level bulk
operations from starting or restarting the component. You can restart the component
explicitly while Maintenance Mode is on.
To review the Ambari Install Wizard procedure, see Installing, Configuring, and Deploying
the Cluster in the Ambari Installation Guide.
2.4. Administering Ambari
Use Admin options to manage the following administrative tasks:
36
Hortonworks Data Platform Jul 15, 2014
To manage local users in Ambari Web, select Admin > Users in the left nav bar. You can
add Ambari Web users, grant a user administrator priveleges, delete users, or change the
password for a user account. Ambari supports two user roles: User and Admin. A User can
view metrics, view service status and configuration, and browse job information. An Admin
can do all User tasks and the following ones: start and stop services, modify configurations,
and run service checks.
To change the password for a listed user, choose edit. Or, to add an Ambari Web user,
choose +Add Local User. In both cases, a variant of the user name pop-up appears.
37
Hortonworks Data Platform Jul 15, 2014
To create a new Ambari Web user account, enter a unique Username and Password. Select
the Admin checkbox to add administrator privleges to the user account. Choose Create to
complete the user account creation.
To edit an existing Ambari Web user account, enter the old and new Passwords. Make
your changes, then choose Save.
To delete an existing Ambari Web user account, choose delete, then choose Yes to confirm
the delete action.
1. Set up Kerberos for your cluster. For more information on setting up Kerberos, see
Setting Up Kerberos for Use with Ambari.
a. Get Started: Read the overview of the procedure necessary to set up Kerberos,
supported by the Enable Security Wizard.
38
Hortonworks Data Platform Jul 15, 2014
c. Create Principals and Keytabs: Use this step to check that all your information is
correct. Click Back to make any changes. Click Apply when you are satisfied with the
assignments.
Note
If you have a large cluster, you may want to go to the Create Principals
and Keytabs step first. Step through the wizard accepting the defaults
to get to the appropriate page. On the page, use the Download CSV
button to get a list of all the necessary principals and keytabs in CSV
form, which can be used to set up a script. The list includes hostname,
principal description, principal name, keytab user, keytab group, keytab
permissions, absolute keytab path, and keytab file name.
d. Save and Apply Configuration: This step displays a bar showing the progress of
integrating the Kerberos information into your Ambari Server.
39
Hortonworks Data Platform Jul 15, 2014
40
Hortonworks Data Platform Jul 15, 2014
2. Make changes to the path shown in Base URL. For example, change the URL for redhat6
to http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.2.0, as shown in
the following example:
41
Hortonworks Data Platform Jul 15, 2014
3. Choose Save.
Ambari performs a validation check for each Base URL. The check confirms that Ambari
can reach the URL and the URL points to a valid software repository. A valid repository
finds the metadata associated with the repository.
4. If the validation check succeeds, the Base URLs are saved. If validation fails, a validation
error message displays. Choose one of the following options:
42
Hortonworks Data Platform Jul 15, 2014
To manage access that your non-administrator users have to Jobs information, choose
Access.
43
Hortonworks Data Platform Jul 15, 2014
3. Choose Save.
44
Hortonworks Data Platform Jul 15, 2014
• Detecting and repairing problems, and mitigating future issues, before they affect end-
users and customers
• Leveraging Nagios’ event monitoring capabilities to receive alerts for potential problem
areas
• Analyzing specific trends; for example: what is the CPU usage for a particular Hadoop
service weekdays between 2 p.m. and 5 p.m
Note
Nagios is an optional component of Ambari. During cluster install you can
choose to install and configure Nagios. When selected, out-of-the-box Ambari
provides a set of Nagios plug-ins specially designed for monitoring important
aspects of your Hadoop cluster, based on your Stack selection.
• Host and System Information: Ambari monitors basic host and system information such
as CPU utilization, disk I/O bandwidth and operations per second, average memory and
swap space utilization, and average network latency.
• Service Information: Ambari monitors the health and performance status of each service
by presenting information generated by that service. Because services that run in master/
slave configurations (HDFS, MapReduce, and HBase) are fault tolerant in regard to
service slaves, master information is presented individually, whereas slave information is
presented largely in aggregate.
• OK
• Warning
45
Hortonworks Data Platform Jul 15, 2014
• Critical
The thresholds for these alerts can be tuned using configuration files, and new alerts
can be added. For more details on Nagios architecture, see the Nagios Overview at the
nagios.org web site.
3.2. Installing Nagios
The Ambari Installation Wizard gives you the option of installing and configuring Nagios,
including the out-of-the-box plug-ins for Hadoop-specific alerts. The Nagios server, Nagios
plug-ins, and the web-based user interface are installed on the Nagios server host, as
specified during the installation procedure.
By default, the Nagios server runs as a user named “nagios” which is in a group also named
“nagios”. The user and group can be customized during the Ambari Cluster Install (Cluster
Install Wizard > Customize Services > Misc). Once Nagios is installed, use Ambari Web to
start and stop the Nagios server.
Maximum number of check The max number of retry attempts. Usually when the
attempts state of a service changes, this change is considered
46
Hortonworks Data Platform Jul 15, 2014
• Host-level Alerts
These alerts are related to a specific host and specific component running on that host.
These alerts check a component and system-level metrics to determine health of the host.
• Service-level Alerts
These alerts are related to a Hadoop Service and do not pertain to a specific host. These
alerts check one or more components of a service as well as system-level metrics to
determine overall health of a Hadoop Service.
3.5.1.1. Blocks health
This service-level alert is triggered if the number of corrupt or missing blocks exceeds the
configured critical threshold. This alert uses the check_hdfs_blocks plug-in.
3.5.1.1.1. Potential causes
• Some DataNodes are down and the replicas that are missing blocks are only on those
DataNodes
• The corrupt/missing blocks are from files with a replication factor of 1. New replicas
cannot be created because the only replica of the block is missing
3.5.1.1.2. Possible remedies
• Identify the files associated with the missing or corrupt blocks by running the Hadoop
fsck command
• Delete the corrupt files and recover them from backup, if it exists
47
Hortonworks Data Platform Jul 15, 2014
3.5.1.2. NameNode process
This host-level alert is triggered if the NameNode process cannot be confirmed to be up and
listening on the network for the configured critical threshold, given in seconds. It uses the
Nagios check_tcp
3.5.1.2.1. Potential causes
• The NameNode process is up and running but not listening on the correct network port
(default 8201)
• The Nagios server cannot connect to the HDFS master through the network.
3.5.1.2.2. Possible remedies
• Check for any errors in the logs (/var/log/hadoop/hdfs/)and restart the NameNode
host/process using the HMC Manage Services tab.
• Run the netstat-tuplpn command to check if the NameNode process is bound to the
correct network port
• Use ping to check the network connection between the Nagios server and the
NameNode
3.5.1.3. DataNode space
This host-level alert is triggered if storage capacity if full on the DataNode (90% critical).
It uses the check_datanode_storage.php plug-in which checks the DataNode JMX
Servlet for the Capacity and Remaining properties.
3.5.1.3.1. Potential causes
3.5.1.3.2. Possible remedies
• If cluster still has storage, use Balancer to distribute the data to relatively less used
datanodes
• If the cluster is full, delete unnecessary data or add additional storage by adding either
more DataNodes or more or larger disks to the DataNodes. After adding more storage
run Balancer
3.5.1.4. DataNode process
This host-level alert is triggered if the individual DataNode processes cannot be established
to be up and listening on the network for the configured critical threshold, given in
seconds. It uses the Nagios check_tcp plugin.
48
Hortonworks Data Platform Jul 15, 2014
3.5.1.4.1. Potential causes
• DataNode process is down or not responding
• DataNode are not down but is not listening to the correct network port/address
3.5.1.4.2. Possible remedies
• Check for dead DataNodes in Ambari Web.
• Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the
DataNode, if necessary
• Run the netstat-tuplpn command to check if the DataNode process is bound to the
correct network port
• Use ping to check the network connection between the Nagios server and the
DataNode
3.5.1.5.1. Potential causes
• Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but
this is generally the sign of an issue in the daemon.
3.5.1.5.2. Possible remedies
• Use the top command to determine which processes are consuming excess CPU.
3.5.1.6.1. Potential causes
• At least one of the multiple edit log directories is mounted over NFS and has become
unreachable
• The permissions on at least one of the multiple edit log directories has become read-only
3.5.1.6.2. Possible remedies
• Check permissions on all edit log directories
49
Hortonworks Data Platform Jul 15, 2014
3.5.1.7. NameNode Web UI
This host-level alert is triggered if the NameNode Web UI is unreachable.
3.5.1.7.1. Potential causes
3.5.1.7.2. Possible remedies
• Check whether the Nagios Server can ping the NameNode server.
• Using a browser, check whether the Nagios Server can reach the NameNode Web UI.
3.5.1.8.1. Potential causes
3.5.1.8.2. Possible remedies
• If cluster still has storage, use Balancer to distribute the data to relatively less used
DataNodes
• If the cluster is full, delete unnecessary data or add additional storage by adding either
more DataNodes or more or larger disks to the DataNodes. After adding more storage
run Balancer
3.5.1.9.1. Potential causes
50
Hortonworks Data Platform Jul 15, 2014
• DataNodes are not down but are not listening to the correct network port/address
3.5.1.9.2. Possible remedies
• Check for dead DataNodes in Ambari Web.
• Check for any errors in the DataNode logs (/var/log/hadoop/hdfs) and restart the
DataNode hosts/processes
• Run the netstat-tuplpn command to check if the DataNode process is bound to the
correct network port.
• Use ping to check the network connection between the Nagios server and the
DataNodes.
3.5.1.10.1. Potential causes
• A job or an application is performing too many NameNode operations.
3.5.1.10.2. Possible remedies
• Review the job or the application for potential bugs causing it to perform too many
NameNode operations.
3.5.1.11.1. Potential causes
• Cluster storage is full
3.5.1.11.2. Possible remedies
• Delete unnecessary data.
51
Hortonworks Data Platform Jul 15, 2014
3.5.2.1. JournalNode process
This host-level alert is triggered if the individual JournalNode process cannot be established
to be up and listening on the network for the configured critical threshold, given in
seconds. It uses the Nagios check_tcp plug-in.
3.5.2.1.1. Potential causes
• The JournalNode process is down or not responding.
• The JournalNode is not down but is not listening to the correct network port/address.
3.5.2.1.2. Possible remedies
• Check if the JournalNode process is dead.
• Use ping to check the network connection between the Nagios server and the
JournalNode host.
3.5.2.2.1. Potential causes
• The Active, Standby or both NameNode processes are down.
3.5.2.2.2. Possible remedies
• On each host running NameNode, check for any errors in the logs (/var/log/hadoop/
hdfs/) and restart the NameNode host/process using Ambari Web.
• On each host running NameNode, run the netstat-tuplpn command to check if the
NameNode process is bound to the correct network port.
• Use ping to check the network connection between the Nagios server and the hosts
running NameNode.
52
Hortonworks Data Platform Jul 15, 2014
3.5.3.1. ResourceManager process
This host-level alert is triggered if the individual ResourceManager process cannot be
established to be up and listening on the network for the configured critical threshold,
given in seconds. It uses the Nagios check_tcp plug-in.
3.5.3.1.1. Potential causes
• The ResourceManager process is down or not responding.
• The ResourceManager is not down but is not listening to the correct network port/
address.
3.5.3.1.2. Possible remedies
• Check for a dead ResourceManager.
• Use ping to check the network connection between the Nagios Server and the
ResourceManager host.
3.5.3.2.1. Potential causes
• NodeManagers are down.
• NodeManagers are not down but are not listening to the correct network port/address .
3.5.3.2.2. Possible remedies
• Check for dead NodeManagers.
• Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart
the NodeManagers hosts/processes, as necessary.
• Use ping to check the network connection between the Nagios Server and the
NodeManagers host.
3.5.3.3. ResourceManager Web UI
This host-level alert is triggered if the ResourceManager Web UI is unreachable.
53
Hortonworks Data Platform Jul 15, 2014
3.5.3.3.1. Potential causes
• The ResourceManager Web UI is unreachable from the Nagios Server.
3.5.3.3.2. Possible remedies
• Check if the ResourceManager process is running.
• Check whether the Nagios Server can ping the ResourceManager server.
• Using a browser, check whether the Nagios Server can reach the ResourceManager Web
UI.
3.5.3.4.1. Potential causes
• A job or an application is performing too many ResourceManager operations.
3.5.3.4.2. Possible remedies
• Review the job or the application for potential bugs causing it to perform too many
ResourceManager operations.
3.5.3.5.1. Potential causes
• Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but
this is generally the sign of an issue in the daemon.
3.5.3.5.2. Possible remedies
• Use the top command to determine which processes are consuming excess CPU.
3.5.3.6. NodeManager process
This host-level alert is triggered if the NodeManager process cannot be established to be up
and listening on the network for the configured critical threshold, given in seconds. It uses
the Nagios check_tcp plug-in.
54
Hortonworks Data Platform Jul 15, 2014
3.5.3.6.1. Potential causes
• NodeManager is not down but is not listening to the correct network port/address.
3.5.3.6.2. Possible remedies
• Check for any errors in the NodeManager logs (/var/log/hadoop/yarn) and restart
the NodeManager, if necessary.
• Use ping to check the network connection between the Nagios Server and the
NodeManager host.
3.5.3.7. NodeManager health
This host-level alert checks the node health property available from the NodeManager
component.
3.5.3.7.1. Potential causes
3.5.3.7.2. Possible remedies
• Check in the NodeManager logs (/var/log/hadoop/yarn) for health check errors and
restart the NodeManager, and restart if necessary.
3.5.4.1. HistoryServer Web UI
This host-level alert is triggered if the HistoryServer Web UI is unreachable.
3.5.4.1.1. Potential causes
3.5.4.1.2. Possible remedies
55
Hortonworks Data Platform Jul 15, 2014
• Check whether the Nagios Server can ping the HistoryServer server.
• Using a browser, check whether the Nagios Server can reach the HistoryServer Web UI.
3.5.4.2.1. Potential causes
3.5.4.2.2. Possible remedies
• Review the job or the application for potential bugs causing it to perform too many
HistoryServer operations.
3.5.4.3.1. Potential causes
• Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but
this is generally the sign of an issue in the daemon.
3.5.4.3.2. Possible remedies
• Use the top command to determine which processes are consuming excess CPU.
3.5.4.4. HistoryServer process
This host-level alert is triggered if the HistoryServer process cannot be established to be up
and listening on the network for the configured critical threshold, given in seconds. It uses
the Nagios check_tcp plug-in.
3.5.4.4.1. Potential causes
• HistoryServer is not down but is not listening to the correct network port/address.
56
Hortonworks Data Platform Jul 15, 2014
3.5.4.4.2. Possible remedies
• Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart
the HistoryServer, if necessary
• Use ping to check the network connection between the Nagios Server and the
HistoryServer host.
3.5.5.1.1. Potential causes
3.5.5.1.2. Possible remedies
• Review the job or the application for potential bugs causing it to perform too many
JobTracker operations
3.5.5.2. JobTracker process
This host-level alert is triggered if the individual JobTracker process cannot be confirmed to
be up and listening on the network for the configured critical threshold, given in seconds. It
uses the Nagios check_tcp plug-in.
3.5.5.2.1. Potential causes
• JobTracker is not down but is not listening to the correct network port/address.
3.5.5.2.2. Possible remedies
• Check for any errors in the JobTracker logs (/var/log/hadoop/mapred) and restart
the JobTracker, if necessary
57
Hortonworks Data Platform Jul 15, 2014
• Use ping to check the network connection between the Nagios Server and the
JobTracker host.
3.5.5.3. JobTracker Web UI
This Host-level alert is triggered if the JobTracker Web UI is unreachable.
3.5.5.3.1. Potential causes
• The JobTracker Web UI is unreachable from the Nagios Server.
3.5.5.3.2. Possible remedies
• Check if the JobTracker process is running.
• Check whether the Nagios Server can ping the JobTracker server.
• Using a browser, check whether the Nagios Server can reach the JobTracker Web UI.
3.5.5.4.1. Potential causes
• Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but
this is generally the sign of an issue in the daemon.
3.5.5.4.2. Possible remedies
• Use the top command to determine which processes are consuming excess CPU
3.5.5.5. HistoryServer Web UI
This host-level alert is triggered if the HistoryServer Web UI is unreachable.
3.5.5.5.1. Potential causes
• The HistoryServer Web UI is unreachable from the Nagios Server.
• Using a browser, check whether the Nagios Server can reach the HistoryServer Web UI.
3.5.5.5.2. Possible remedies
• Check the HistoryServer process is running.
58
Hortonworks Data Platform Jul 15, 2014
• Check whether the Nagios Server can ping the HistoryServer server.
3.5.5.6. HistoryServer process
This host-level alert is triggered if the HistoryServer process cannot be established to be up
and listening on the network for the configured critical threshold, given in seconds. It uses
the Nagios check_tcp plug-in.
3.5.5.6.1. Potential causes
• The HistoryServer process is down or not responding.
• The HistoryServer is not down but is not listening to the correct network port/address.
3.5.5.6.2. Possible remedies
• Check for any errors in the HistoryServer logs (/var/log/hadoop/mapred) and restart
the HistoryServer, if necessary.
• Use ping to check the network connection between the Nagios Server and the
HistoryServer host.
3.5.6.1.1. Potential causes
• Misconfiguration or less-than-ideal configuration caused the RegionServers to crash
• The RegionServers shut themselves own because there were problems in the dependent
services, ZooKeeper or HDFS
• GC paused the RegionServer for too long and the RegionServers lost contact with Zoo-
keeper
3.5.6.1.2. Possible remedies
• Check the dependent services to make sure they are operating correctly.
59
Hortonworks Data Platform Jul 15, 2014
• If the failure was associated with a particular workload, try to understand the workload
better
3.5.6.2.1. Potential causes
• The HBase master process is down
• The HBase master has shut itself down because there were problems in the dependent
services, ZooKeeper or HDFS
• The Nagios server cannot connect to the HBase master through the network
3.5.6.2.2. Possible remedies
• Check the dependent services.
• Look at the master log files (usually /var/log/hbase/*.log) for further information
Use ping to check the network connection between the Nagios server and the HBase
master
3.5.6.3.1. Potential causes
• The HBase Master Web UI is unreachable from the Nagios Server.
3.5.6.3.2. Possible remedies
• Check if the Master process is running.
• Check whether the Nagios Server can ping the HBase Master server.
• Using a browser, check whether the Nagios Server can reach the HBase Master Web UI.
60
Hortonworks Data Platform Jul 15, 2014
3.5.6.4.1. Potential causes
• Unusually high CPU utilization: Can be caused by a very unusual job/query workload, but
this is generally the sign of an issue in the daemon.
3.5.6.4.2. Possible remedies
• Use the top command to determine which processes are consuming excess CPU
3.5.6.5. RegionServer process
This host-level alert is triggered if the RegionServer processes cannot be confirmed to be up
and listening on the network for the configured critical threshold, given in seconds. It uses
the Nagios check_tcp plug-in.
3.5.6.5.1. Potential causes
• The RegionServer process is down on the host
• The RegionServer process is up and running but not listening on the correct network port
(default 60030).
• The Nagios server cannot connect to the RegionServer through the network.
3.5.6.5.2. Possible remedies
• Check for any errors in the logs (/var/log/hbase/) and restart the RegionServer
process using Ambari Web.
• Use ping to check the network connection between the Nagios Server and the
RegionServer.
3.5.7. Hive Alerts
These alerts are used to monitor the Hive service.
3.5.7.1. Hive-Metastore status
This host-level alert is triggered if the Hive Metastore process cannot be determined to be
up and listening on the network for the configured critical threshold, given in seconds. It
uses the Nagios check_tcp plug-in.
61
Hortonworks Data Platform Jul 15, 2014
3.5.7.1.1. Potential causes
• The Hive Metastore service is down.
3.5.7.1.2. Possible remedies
• Using Ambari Web, stop the Hive service and then restart it.
• Use ping to check the network connection between the Nagios server and the Hive
Metastore server.
3.5.8. WebHCat Alerts
These alerts are used to monitor the WebHCat service.
3.5.8.1.1. Potential causes
• The WebHCat server is down.
3.5.8.1.2. Possible remedies
• Restart the WebHCat server using Ambari Web.
3.5.9. Oozie Alerts
These alerts are used to monitor the Oozie service.
3.5.9.1. Oozie status
This host-level alert is triggered if the Oozie server cannot be determined to be up and
responding to client requests.
3.5.9.1.1. Potential causes
• The Oozie server is down.
62
Hortonworks Data Platform Jul 15, 2014
3.5.9.1.2. Possible remedies
• Restart the Oozie service using Ambari Web.
3.5.10. Ganglia Alerts
These alerts are used to monitor the Ganglia service.
3.5.10.1.1. Potential causes
• The Ganglia server process is down.
• The network connection is down between the Nagios and Ganglia servers
3.5.10.1.2. Possible remedies
• Check the Ganglia server (gmetad) related log in /var/log/messages for any errors.
• Slaves
• NameNode
• HBase Master
3.5.10.2.1. Potential causes
• A gmond process is down.
63
Hortonworks Data Platform Jul 15, 2014
• The network connection is down between the Nagios and Ganglia servers.
3.5.10.2.2. Possible remedies
3.5.11. Nagios Alerts
These alerts are used to monitor the Nagios service.
3.5.11.1.1. Potential causes
• The Nagios server is hanging and thus not scheduling new alerts
• The file /var/nagios/status.dat does not have appropriate write permissions for
the Nagios user.
3.5.11.1.2. Possible remedies
3.5.12. ZooKeeper Alerts
These alerts are used to monitor the Zookeeper service.
3.5.12.1.1. Potential causes
• The majority of your ZooKeeper servers are down and not responding.
3.5.12.1.2. Possible remedies
• Check the dependent services to make sure they are operating correctly.
64
Hortonworks Data Platform Jul 15, 2014
• If the failure was associated with a particular workload, try to understand the workload
better.
3.5.12.2.1. Potential causes
• The ZooKeeper server process is down on the host.
• The ZooKeeper server process is up and running but not listening on the correct network
port (default 2181).
• The Nagios server cannot connect to the ZooKeeper server through the network.
3.5.12.2.2. Possible remedies
• Check for any errors in the ZooKeeper logs (/var/log/hbase/) and restart the
ZooKeeper process using Ambari Web.
• Run the netstat-tuplpn command to check if the ZooKeeper server process is bound
to the correct network port.
• Use ping to check the network connection between the Nagios server and the
ZooKeeper server.
3.5.13. Ambari Alerts
This alert is used to monitor the Ambari Agent service.
3.5.13.1.1. Potential causes
• The Ambari Agent process is down on the host.
• The Ambari Agent process is up and running but heartbeating to the Ambari Server.
• The Ambari Agent process is up and running but is unreachable through the network
from the Nagios Server.
• The Ambari Agent cannot connect to the Ambari Server through the network.
65
Hortonworks Data Platform Jul 15, 2014
3.5.13.1.2. Possible remedies
• Check for any errors in the logs (/var/log/ambari-agent/ambari-agent.log)
and restart the Ambari Agent process.
• Use ping to check the network connection between the Ambari Agent host and the
Ambari Servers.
66