Cloudera Installation
Cloudera Installation
Important Notice
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service
names or slogans contained in this document are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part,
without the prior written permission of Cloudera or the applicable trademark holder.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and
company names or logos mentioned in this document are the property of their
respective owners. Reference to any products, services, processes or other
information, by trade name, trademark, manufacturer, supplier or otherwise does
not constitute or imply endorsement, sponsorship or recommendation thereof by
us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any
means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Cloudera.
Cloudera, Inc.
1001 Page Mill Road Bldg 2
Palo Alto, CA 94304
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: 5.4.x
Date: May 20, 2015
Table of Contents
Requirements.............................................................................................................7
Cloudera Manager 5 Requirements and Supported Versions................................................................7
Supported Operating Systems................................................................................................................................7
Supported JDK Versions...........................................................................................................................................7
Supported Browsers.................................................................................................................................................8
Supported Databases...............................................................................................................................................8
Supported CDH and Managed Service Versions....................................................................................................8
Resource Requirements..........................................................................................................................................9
Networking and Security Requirements................................................................................................................9
Single User Mode Requirements..........................................................................................................................12
Permission Requirements........................................................................................................................15
Cloudera Navigator 2 Requirements and Supported Versions.............................................................17
Cloudera Manager Requirements.........................................................................................................................17
Supported Databases.............................................................................................................................................17
Supported Browsers...............................................................................................................................................18
Supported CDH and Managed Service Versions..................................................................................................18
CDH 5 Requirements and Supported Versions......................................................................................19
Supported Operating Systems..............................................................................................................................20
Supported Databases.............................................................................................................................................21
Supported JDK Versions.........................................................................................................................................21
Supported Internet Protocol..................................................................................................................................21
Supported Configurations with Virtualization and Cloud Platforms...................................................22
Microsoft Azure......................................................................................................................................................22
VMware....................................................................................................................................................................22
Ports............................................................................................................................................................22
Ports Used by Cloudera Manager and Cloudera Navigator...............................................................................23
Ports Used by Components of CDH 5...................................................................................................................26
Ports Used by Components of CDH 4...................................................................................................................32
Ports Used by Cloudera Impala.............................................................................................................................37
Ports Used by Cloudera Search.............................................................................................................................38
Ports Used by Third-Party Components..............................................................................................................38
Installation................................................................................................................40
Cloudera Manager Deployment...............................................................................................................40
Unmanaged Deployment..........................................................................................................................41
Java Development Kit Installation...........................................................................................................41
Installing the Oracle JDK........................................................................................................................................42
Installing Cloudera Manager, CDH, and Managed Services...................................................................42
Cloudera Manager Installation Software.............................................................................................................43
Cloudera Manager and Managed Service Data Stores.......................................................................................44
Managing Software Installation...........................................................................................................................80
Installation Path A - Automated Installation by Cloudera Manager................................................................98
Installation Path B - Manual Installation Using Cloudera Manager Packages.............................................106
Installation Path C - Manual Installation Using Cloudera Manager Tarballs................................................122
Installing Impala...................................................................................................................................................132
Installing Search...................................................................................................................................................132
Installing Spark.....................................................................................................................................................133
Installing Key Trustee KMS.................................................................................................................................133
Installing GPL Extras............................................................................................................................................134
Understanding Custom Installation Solutions..................................................................................................135
Deploying Clients..................................................................................................................................................156
Testing the Installation........................................................................................................................................157
Uninstalling Cloudera Manager and Managed Software.................................................................................158
Uninstalling CDH From a Single Host.................................................................................................................162
Installing Cloudera Navigator.................................................................................................................163
Installing and Deploying CDH Using the Command Line....................................................................164
Before You Install CDH 5 on a Cluster................................................................................................................164
Creating a Local Yum Repository........................................................................................................................165
Installing the Latest CDH 5 Release...................................................................................................................166
Installing an Earlier CDH 5 Release....................................................................................................................178
CDH 5 and MapReduce........................................................................................................................................181
Migrating from MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN)..........................................................182
Tuning the Cluster for MapReduce v2 (YARN)...................................................................................................195
Deploying CDH 5 on a Cluster..............................................................................................................................200
Installing CDH 5 Components.............................................................................................................................226
Building RPMs from CDH Source RPMs.............................................................................................................436
Apache and Third-Party Licenses.......................................................................................................................437
Uninstalling CDH Components............................................................................................................................438
Viewing the Apache Hadoop Documentation...................................................................................................441
Upgrade...................................................................................................................442
Upgrading Cloudera Manager................................................................................................................442
Database Considerations for Cloudera Manager Upgrades............................................................................443
Upgrading Cloudera Manager 5 to the Latest Cloudera Manager..................................................................445
Upgrading Cloudera Manager 4 to Cloudera Manager 5..................................................................................457
Upgrading Cloudera Manager 3.7.x....................................................................................................................475
Re-Running the Cloudera Manager Upgrade Wizard.......................................................................................475
Reverting a Failed Cloudera Manager Upgrade................................................................................................476
Upgrading Cloudera Navigator...............................................................................................................478
Upgrading CDH and Managed Services Using Cloudera Manager.....................................................479
Configuring the CDH Version of a Cluster..........................................................................................................480
Performing a Rolling Upgrade on a CDH 5 Cluster............................................................................................480
Performing a Rolling Upgrade on a CDH 4 Cluster............................................................................................483
Upgrading to CDH Maintenance Releases.........................................................................................................485
Upgrading to CDH 5.4...........................................................................................................................................494
Upgrading to CDH 5.3...........................................................................................................................................509
Upgrading to CDH 5.2...........................................................................................................................................523
Upgrading to CDH 5.1...........................................................................................................................................536
Upgrading CDH 4 to CDH 5..................................................................................................................................546
Upgrading CDH 4...................................................................................................................................................563
Upgrading CDH 3...................................................................................................................................................572
Upgrading Unmanaged CDH Using the Command Line......................................................................572
Upgrading from CDH 4 to CDH 5.........................................................................................................................573
Upgrading from an Earlier CDH 5 Release to the Latest Release...................................................................590
Upgrading to Oracle JDK 1.7 before Upgrading to CDH 5....................................................................609
Upgrading to JDK 1.7 in a Cloudera Manager Deployment..............................................................................609
Upgrading to JDK 1.7 in an Unmanaged Deployment......................................................................................610
Upgrading to Oracle JDK 1.8...................................................................................................................610
Upgrading to JDK 1.8 in a Cloudera Manager Deployment..............................................................................610
Upgrading to JDK 1.8 in an Unmanaged Deployment Using the Command Line..........................................611
Requirements
This section describes the requirements for installing Cloudera Manager, Cloudera Navigator, and CDH 5.
• SLES - SUSE Linux Enterprise Server 11, 64-bit. Service Pack 2 or later is required for CDH 5, and Service Pack
1 or later is required for CDH 4. To use the embedded PostgreSQL database that is installed when you follow
Installation Path A - Automated Installation by Cloudera Manager on page 98, the Updates repository must
be active. The SUSE Linux Enterprise Software Development Kit 11 SP1 is required on hosts running the
Cloudera Manager Agents.
• Debian - Wheezy (7.0 and 7.1), Squeeze (6.0) (deprecated), 64-bit
• Ubuntu - Trusty (14.04), Precise (12.04), Lucid (10.04) (deprecated), 64-bit
Note:
• Debian Squeeze and Ubuntu Lucid are supported only for CDH 4.
• Using the same version of the same operating system on all cluster hosts is strongly recommended.
when it's managing both CDH 4.x and CDH 5.x clusters. Oracle JDK 1.6.0_31 and 1.7.0_75 can be installed during
the installation and upgrade. For further information, see Java Development Kit Installation on page 41.
Supported Browsers
The Cloudera Manager Admin Console, which you use to install, configure, manage, and monitor services, supports
the following browsers:
• Mozilla Firefox 11 and higher.
• Google Chrome.
• Internet Explorer 9 and higher. Internet Explorer 11 Native Mode.
• Safari 5 and higher.
Supported Databases
Cloudera Manager requires several databases. The Cloudera Manager Server stores information about configured
services, role assignments, configuration history, commands, users, and running processes in a database of its
own. You must also specify a database for the Activity Monitor and Reports Manager roles.
Important: When processes restart, the configuration for each of the services is redeployed using
information that is saved in the Cloudera Manager database. If this information is not available, your
cluster will not start or function correctly. You must therefore schedule and maintain regular backups
of the Cloudera Manager database in order to recover the cluster in the event of the loss of this
database. See Backing Up Databases on page 65.
The database you use must be configured to support UTF8 character set encoding. The embedded PostgreSQL
database that is installed when you follow Installation Path A - Automated Installation by Cloudera Manager
on page 98 automatically provides UTF8 encoding. If you install a custom database, you may need to enable
UTF8 encoding. The commands for enabling UTF8 encoding are described in each database topic under Cloudera
Manager and Managed Service Data Stores on page 44.
After installing a database, upgrade to the latest patch version and apply any other appropriate updates. Available
updates may be specific to the operating system on which it is installed.
Cloudera Manager and its supporting services can use the following databases:
• MySQL - 5.5 and 5.6
• Oracle 11gR2
• PostgreSQL - 8.4, 9.2, and 9.3
Cloudera supports the shipped version of MySQL and PostgreSQL for each supported Linux distribution. Each
database is supported for all components in Cloudera Manager and CDH subject to the notes in CDH 4 Supported
Databases and CDH 5 Supported Databases.
Warning: Cloudera Manager 5 does not support CDH 3 and you cannot upgrade Cloudera Manager 4
to Cloudera Manager 5 if you have a cluster running CDH 3.Therefore, to upgrade CDH 3 clusters to
CDH 4 using Cloudera Manager, you must use Cloudera Manager 4.
• CDH 4 and CDH 5. The latest released versions of CDH 4 and CDH 5 are strongly recommended. For information
on CDH 4 requirements, see CDH 4 Requirements and Supported Versions. For information on CDH 5
requirements, see CDH 5 Requirements and Supported Versions on page 19.
• Cloudera Impala - Cloudera Impala is included with CDH 5. Cloudera Impala 1.2.1 with CDH 4.1.0 or later. For
more information on Cloudera Impala requirements with CDH 4, see Cloudera Impala Requirements.
• Cloudera Search - Cloudera Search is included with CDH 5. Cloudera Search 1.2.0 with CDH 4.6.0. For more
information on Cloudera Search requirements with CDH 4, see Cloudera Search Requirements.
• Apache Spark - 0.90 or later with CDH 4.4.0 or later.
• Apache Accumulo - 1.4.3 with CDH 4.3.0, 1.4.4 with CDH 4.5.0, and 1.6.0 with CDH 4.6.0.
For more information, see the Product Compatibility Matrix.
Resource Requirements
Cloudera Manager requires the following resources:
• Disk Space
– Cloudera Manager Server
– 5 GB on the partition hosting /var.
– 500 MB on the partition hosting /usr.
– For parcels, the space required depends on the number of parcels you download to the Cloudera
Manager Server and distribute to Agent hosts. You can download multiple parcels of the same product,
of different versions and builds. If you are managing multiple clusters, only one parcel of a
product/version/build/distribution is downloaded on the Cloudera Manager Server—not one per
cluster. In the local parcel repository on the Cloudera Manager Server, the approximate sizes of the
various parcels are as follows:
– CDH 4.6 - 700 MB per parcel; CDH 5 (which includes Impala and Search) - 1.5 GB per parcel (packed),
2 GB per parcel (unpacked)
– Cloudera Impala - 200 MB per parcel
– Cloudera Search - 400 MB per parcel
– Cloudera Management Service -The Host Monitor and Service Monitor databases are stored on the
partition hosting /var. Ensure that you have at least 20 GB available on this partition.For more information,
see Data Storage for Monitoring Data on page 67.
– Agents - On Agent hosts each unpacked parcel requires about three times the space of the downloaded
parcel on the Cloudera Manager Server. By default unpacked parcels are located in
/opt/cloudera/parcels.
• RAM - 4 GB is recommended for most cases and is required when using Oracle databases. 2 GB may be
sufficient for non-Oracle deployments with fewer than 100 hosts. However, to run the Cloudera Manager
Server on a machine with 2 GB of RAM, you must tune down its maximum heap size (by modifying -Xmx in
/etc/default/cloudera-scm-server). Otherwise the kernel may kill the Server for consuming too much
RAM.
• Python - Cloudera Manager and CDH 4 require Python 2.4 or later, but Hue in CDH 5 and package installs of
CDH 5 require Python 2.6 or 2.7. All supported operating systems include Python version 2.4 or later.
• Perl - Cloudera Manager requires perl.
– Contain consistent information about hostnames and IP addresses across all hosts
– Not contain uppercase hostnames
– Not contain duplicate IP addresses
Also, do not use aliases, either in /etc/hosts or in configuring DNS. A properly formatted /etc/hosts file
should be similar to the following example:
• In most cases, the Cloudera Manager Server must have SSH access to the cluster hosts when you run the
installation or upgrade wizard. You must log in using a root account or an account that has password-less
sudo permission. For authentication during the installation and upgrade procedures, you must either enter
the password or upload a public and private key pair for the root or sudo user account. If you want to use a
public and private key pair, the public key must be installed on the cluster hosts before you use Cloudera
Manager.
Cloudera Manager uses SSH only during the initial install or upgrade. Once the cluster is set up, you can
disable root SSH access or change the root password. Cloudera Manager does not save SSH credentials, and
all credential information is discarded when the installation is complete. For more information, see Permission
Requirements on page 15.
• If single user mode is not enabled, the Cloudera Manager Agent runs as root so that it can make sure the
required directories are created and that processes and files are owned by the appropriate user (for example,
the hdfs and mapred users).
• No blocking is done by Security-Enhanced Linux (SELinux).
• IPv6 must be disabled.
• No blocking by iptables or firewalls; port 7180 must be open because it is used to access Cloudera Manager
after installation. Cloudera Manager communicates using specific ports, which must be open.
• For RedHat and CentOS, the /etc/sysconfig/network file on each host must contain the hostname you
have just set (or verified) for that host.
• Cloudera Manager and CDH use several user accounts and groups to complete their tasks. The set of user
accounts and groups varies according to the components you choose to install. Do not delete these accounts
or groups and do not modify their permissions and rights. Ensure that no existing systems prevent these
accounts and groups from functioning. For example, if you have scripts that delete user accounts not in a
whitelist, add these accounts to the list of permitted accounts. Cloudera Manager, CDH, and managed services
create and use the following accounts and groups:
Apache Flume flume flume The sink that writes to HDFS as this user must
(CDH 4, CDH 5) have write privileges.
Apache HBase hbase hbase The Master and the RegionServer processes run
(CDH 4, CDH 5) as this user.
HDFS (CDH 4, CDH hdfs hdfs, hadoop The NameNode and DataNodes run as this user,
5) and the HDFS root directory as well as the
directories used for edit logs should be owned by
it.
Apache Hive (CDH hive hive The HiveServer2 process and the Hive Metastore
4, CDH 5) processes run as this user.
A user must be defined for Hive access to its
Metastore DB (e.g. MySQL or Postgres) but it can
be any identifier and does not correspond to a Unix
uid. This is
javax.jdo.option.ConnectionUserName in
hive-site.xml.
Apache HCatalog hive hive The WebHCat service (for REST access to Hive
(CDH 4.2 and functionality) runs as the hive user.
higher, CDH 5)
HttpFS (CDH 4, httpfs httpfs The HttpFS service runs as this user. See HttpFS
CDH 5) Security Configuration for instructions on how to
generate the merged httpfs-http.keytab file.
Hue (CDH 4, CDH hue hue Hue services run as this user.
5)
Cloudera Impala impala impala, hadoop, Impala services run as this user.
(CDH 4.1 and hdfs, hive
higher, CDH 5)
Apache Kafka kafka kafka Kafka services run as this user.
(Cloudera
Distribution of
Kafka 1.2.0)
Java KeyStore kms kms The Java KeyStore KMS service runs as this user.
KMS (CDH 5.2.1
and higher)
Key Trustee KMS kms kms The Key Trustee KMS service runs as this user.
(CDH 5.3 and
higher)
Key Trustee keytrustee keytrustee The Key Trustee Server service runs as this user.
Server (CDH 5.4
and higher)
Llama (CDH 5) llama llama Llama runs as this user.
Apache Mahout No special users.
MapReduce (CDH mapred mapred, hadoop Without Kerberos, the JobTracker and tasks run
4, CDH 5) as this user. The LinuxTaskController binary is
owned by this user for Kerberos.
Apache Oozie oozie oozie The Oozie service runs as this user.
(CDH 4, CDH 5)
Parquet No special users.
Apache Pig No special users.
Cloudera Search solr solr The Solr processes run as this user.
(CDH 4.3 and
higher, CDH 5)
Apache Spark spark spark The Spark History Server process runs as this user.
(CDH 5)
Apache Sentry sentry sentry The Sentry service runs as this user.
(incubating) (CDH
5.1 and higher)
Apache Sqoop sqoop sqoop This user is only for the Sqoop1 Metastore, a
(CDH 4, CDH 5) configuration option that is not recommended.
Apache Sqoop2 sqoop2 sqoop, sqoop2 The Sqoop2 service runs as this user.
(CDH 4.2 and
higher, CDH 5)
Apache Whirr No special users.
YARN (CDH 4, yarn yarn, hadoop Without Kerberos, all YARN services and
CDH 5) applications run as this user. The
LinuxContainerExecutor binary is owned by this
user for Kerberos.
Apache zookeeper zookeeper The ZooKeeper processes run as this user. It is not
ZooKeeper (CDH configurable.
4, CDH 5)
Limitations
• Switching between conventional and single user mode is not supported.
• Single user mode is supported for clusters running CDH 5.2 and higher.
• NFS Gateway is not supported in single user mode.
• For a single user username, create the process limits configuration file at
/etc/security/limits.d/username.conf with the following settings:
Configuration Steps Before Starting Cloudera Manager Agents in Installation Paths B and C
• If you manually install Agent packages, before starting the Agents, configure them to run as cloudera-scm
by editing the file /etc/defaults/cloudera-scm-agent and uncommenting the line:
USER="cloudera-scm"
or adding a new sudo configuration for the cloudera-scm group by running the command visduo and then
adding the following line:
• Sudo must be configured so that /usr/sbin is in the path when running sudo. One way to achieve this is
by adding the following configuration to sudoers:
1. Edit the /etc/sudoers file using the visduo command
2. Add this line to the configuration file:
• Roles that run on Tomcat require some directories to exist in non-configurable paths. The following directories
must be created and be writable by cloudera-scm:
– HDFS (HttpFS role) - /var/lib/hadoop-httpfs
– Oozie Server - /var/lib/oozie
– Sqoop 2 Server - /var/lib/sqoop2
– Solr Server - /var/lib/solr
• Cloudera recommends that you create a prefix directory (for example, /cm) owned by cloudera-scm under
which all other service directories will be placed. In single user mode, the Cloudera Manager Agent creates
directories under the prefix directory with the correct ownership. If hosts have additional volumes on them
that will be used for data directories Cloudera recommends creating a directory on each volume (for example,
/data0/cm and /data1/cm) that is writable by cloudera-scm.
Configuration Steps Before Starting the Installation Wizard in Installation Paths B and C
Perform the following steps for the indicated scenarios:
• Path C - Do one of the following:
– Create and change the ownership of /var/lib/cloudera-scm-server to the single user.
– Set the Cloudera Manager Server local storage directory to one owned by the single user:
1. Go to Administration > Settings > Advanced.
2. Set the Cloudera Manager Server Local Data Storage Directory property to a directory owned by the
single user.
3. Click Save Changes to commit the changes.
• Path B and C when using already managed hosts - Configure single user mode:
Permission Requirements
The following sections describe the permission requirements for package-based installation and upgrades of
CDH with and without Cloudera Manager. The permission requirements are not controlled by Cloudera but result
from standard UNIX system requirements for the installation and management of packages and running services.
Important: Unless otherwise noted, when root and/or sudo access is required, using another system
(such as PowerBroker) that provides root/sudo privileges is acceptable.
Manually start/stop/restart the If single user mode is not enabled, root and/or sudo access.
Cloudera Manager Agent process.
This permission requirement ensures that services managed by the
Cloudera Manager Agent assume the appropriate user (that is, the HDFS
service assumes the hdfs user) for correct privileges. Any action request
for a CDH service managed within Cloudera Manager does not require
root and/or sudo access, because the action is handled by the Cloudera
Manager Agent, which is already running under the root user.
Supported Databases
Cloudera Navigator, which stores audit reports, and entity metadata, policies, and user authorization and audit
report metadata, supports the following databases:
• MySQL - 5.5 and 5.6
• Oracle 11gR2
• PostgreSQL - 8.4, 9.2, and 9.3
Supported Browsers
The Cloudera Navigator UI, which you use to create and view audit reports, search and update metadata, and
configure Cloudera Navigator user groups, supports the following browsers:
• Mozilla Firefox 24 and higher
• Google Chrome 36 and higher
• Internet Explorer 11
• Safari 5 and higher
Component Captured Operations and Notes (For details, see Audit Events and Audit Minimum
Reports). Supported
Version
HDFS • Operations that access or modify a file's or directory's data or metadata CDH 4.0.0
• Operations denied due to lack of privileges
HBase • In CDH versions less than 4.2.0, for grant and revoke operations, the CDH 4.0.0
operation in log events is ADMIN
• In simple authentication mode, if the HBase Secure RPC Engine
property is false (the default), the username in log events is UNKNOWN.
To see a meaningful user name:
1. Click the HBase service.
2. Click the Configuration tab.
3. Select Service-wide > Security.
4. Set the HBase Secure RPC Engine property to true.
5. Save the change and restart the service.
Hive • Operations (except grant, revoke, and metadata access only) sent to CDH 4.2.0, CDH
HiveServer2 4.4.0 for
• Operations denied due to lack of privileges operations
denied due to
• Actions taken against Hive via the Hive CLI are not audited. Therefore lack of privileges.
if you have enabled auditing you should disable the Hive CLI to prevent
actions against Hive that are not audited.
• In simple authentication mode, the username in log events is the
username passed in the HiveServer2 connect command. If you do not
pass a username in the connect command, the username is log events
is anonymous.
Hue • Operations (except grant, revoke, and metadata access only) sent to CDH 4.4.0
Beeswax Server
You do not directly configure the Hue service for auditing. Instead, when
you configure the Hive service for auditing, operations sent to the Hive
service through Beeswax appear in the Hue service audit log.
Component Captured Operations and Notes (For details, see Audit Events and Audit Minimum
Reports). Supported
Version
Impala • Queries denied due to lack of privileges Impala 1.2.1 with
• Queries that pass analysis CDH 4.4.0
Sentry • Operations sent to the HiveServer2 and Hive Metastore Server roles CDH 5.1.0
and Impala service
• Add and delete roles, assign roles to groups and remove roles from
groups, create and delete privileges, grant and revoke privileges
• Operations denied due to lack of privileges
You do not directly configure the Sentry service for auditing. Instead, when
you configure the Hive and Impala services for auditing, grant, revoke, and
metadata operations appear in the Hive or Impala service audit logs.
Solr • Index creation and deletion CDH 5.4.0
• Schema and configuration file modification
• Index, service, document tag access
Note:
• CDH 5 provides only 64-bit packages.
• Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.
• If you are using an operating system that is not supported by Cloudera packages, you can also
download source tarballs from Downloads.
Supported Databases
Component MySQL SQLite PostgreSQL Oracle Derby - see Note 4
Oozie 5.5, 5.6 – 8.4, 9.2, 9.3 11gR2 Default
See Note 2
Note:
1. MySQL 5.5 is supported on CDH 5.1. MySQL 5.6 is supported on CDH 5.1 and later. The InnoDB
storage engine must be enabled in the MySQL server.
2. PostgreSQL 9.2 is supported on CDH 5.1 and later. PostgreSQL 9.3 is supported on CDH 5.2 and
later.
3. For the purposes of transferring data only, Sqoop 1 supports MySQL 5.0 and above, PostgreSQL
8.4 and above, Oracle 10.2 and above, Teradata 13.10 and above, and Netezza TwinFin 5.0 and
above. The Sqoop metastore works only with HSQLDB (1.8.0 and higher 1.x versions; the metastore
does not work with any HSQLDB 2.x versions).
4. Sqoop 2 can transfer data to and from MySQL 5.0 and above, PostgreSQL 8.4 and above, Oracle
10.2 and above, and Microsoft SQL Server 2012 and above. The Sqoop 2 repository database is
supported only on Derby and PostgreSQL.
5. Derby is supported as shown in the table, but not always recommended. See the pages for individual
components in the Cloudera Installation and Upgrade guide for recommendations.
6. CDH 5 Hue requires the default MySQL version of the operating system on which it is being installed
(which is usually MySQL 5.1, 5.5 or 5.6).
Microsoft Azure
For information on deploying Cloudera software on a Microsoft Azure cloud infrastructure, see the Reference
architecture for deploying on Microsoft Azure.
The following limitations and restrictions apply to deploying on Microsoft Azure in the current release:
• Only the D-14 instance types are supported. Use of Virtual Hard Drives (VHDs) is encouraged to increase
storage density.
• Only Cloudera Manager 5.x and CDH 5.x are supported.
• The only supported operating system is CentOS 6.5.
• The following services are supported:
– MRv2 (YARN)
– HDFS
– ZooKeeper
– Sqoop1
– Oozie
– Hive
– Pig
– Crunch
– HBase
• The following services are not yet supported but are available for installation:
– Impala
– Spark
– Solr
VMware
For information on deploying Cloudera software on a VMware-based infrastructure, see the Reference architecture
for deploying on VMware.
The following limitations and restrictions apply to deploying on VMware in the current release:
• Use the part of Hadoop Virtual Extensions that has been implemented in HDFS: HADOOP-8468. This will
prevent data loss when a physical node goes down that hosts two or more DataNodes.
• Isilon and shared storage are not supported.
Ports
Cloudera Manager, CDH components, managed services, and third-party components use the ports listed in the
tables that follow. Before you deploy Cloudera Manager, CDH, and managed services, and third-party components
make sure these ports are open on each system. If you are using a firewall, such as iptables, and cannot open
all the listed ports, you will need to disable the firewall completely to ensure full functionality.
Note:
In the tables in the subsections that follow, the Access Requirement for each port is usually either
"Internal" or "External." In this context, "Internal" means that the port is used only for communication
among the nodes (for example the JournalNode ports in an HA configuration); "External" means that
the port can be used for either internal or external communication (for example, ports used by the
Web UIs for the NodeManager and the JobHistory Server.
For further details, see the following table. All ports listed are TCP.
HttpFS 14000
HttpFS 14001
Impala Daemon Impala Daemon Frontend 21050 External Used to transmit commands
Port and receive results by
applications, such as
Business Intelligence tools,
using JDBC and version 2.0 or
higher of the Cloudera ODBC
driver.
Impala Daemon Impala Daemon Backend Port 22000 Internal Internal use only. Impala
daemons use this port to
communicate with each
other.
Impala Daemon StateStoreSubscriber Service 23000 Internal Internal use only. Impala
Port daemons listen on this port
for updates from the
statestore daemon.
Catalog Daemon StateStoreSubscriber Service 23020 Internal Internal use only. The catalog
Port daemon listens on this port
for updates from the
statestore daemon.
Impala Daemon Impala Daemon HTTP Server 25000 External Impala web interface for
Port administrators to monitor
and troubleshoot.
Impala StateStore StateStore HTTP Server Port 25010 External StateStore web interface for
Daemon administrators to monitor
and troubleshoot.
Impala Catalog Catalog HTTP Server Port 25020 External Catalog service web interface
Daemon for administrators to monitor
and troubleshoot. New in
Impala 1.2 and higher.
Impala StateStore StateStore Service Port 24000 Internal Internal use only. The
Daemon statestore daemon listens on
this port for
registration/unregistration
requests.
Impala Catalog StateStore Service Port 26000 Internal Internal use only. The catalog
Daemon service uses this port to
communicate with the Impala
Impala Daemon Llama Callback Port 28000 Internal Internal use only. Impala
daemons use to
communicate with Llama.
New in CDH 5.0.0 and higher.
Impala Llama Llama Thrift Admin Port 15002 Internal Internal use only. New in CDH
ApplicationMaster 5.0.0 and higher.
Impala Llama Llama Thrift Port 15000 Internal Internal use only. New in CDH
ApplicationMaster 5.0.0 and higher.
Impala Llama Llama HTTP Port 15001 External Llama service web interface
ApplicationMaster for administrators to monitor
and troubleshoot. New in
CDH 5.0.0 and higher.
Installation
This section introduces options for installing Cloudera Manager, CDH, and managed services. You can install:
• Cloudera Manager, CDH, and managed services in a Cloudera Manager deployment. This is the recommended
method for installing CDH and managed services.
• CDH 5 into an unmanaged deployment.
Note: If you intend to deploy Cloudera Manager in a highly-available configuration, see Configuring
Cloudera Manager for High Availability With a Load Balancer before starting your installation.
The Cloudera Manager installation paths share some common phases, but the variant aspects of each path
support different user and cluster host requirements:
• Demonstration and proof of concept deployments - There are two installation options:
– Installation Path A - Automated Installation by Cloudera Manager on page 98 - Cloudera Manager
automates the installation of the Oracle JDK, Cloudera Manager Server, embedded PostgreSQL database,
and Cloudera Manager Agent, CDH, and managed service software on cluster hosts, and configures
databases for the Cloudera Manager Server and Hive Metastore and optionally for Cloudera Management
Service roles. This path is recommended for demonstration and proof of concept deployments, but is not
recommended for production deployments because its not intended to scale and may require database
migration as your cluster grows. To use this method, server and cluster hosts must satisfy the following
requirements:
– Provide the ability to log in to the Cloudera Manager Server host using a root account or an account
that has password-less sudo permission.
– Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts.
See Networking and Security Requirements on page 9 for further information.
– All hosts must have access to standard package repositories and either archive.cloudera.com or
a local repository with the necessary installation files.
– Installation Path B - Manual Installation Using Cloudera Manager Packages on page 106 - you install the
Oracle JDK and Cloudera Manager Server, and embedded PostgreSQL database packages on the Cloudera
Manager Server host. You have two options for installing Oracle JDK, Cloudera Manager Agent, CDH, and
managed service software on cluster hosts: manually install it yourself or use Cloudera Manager to
automate installation. However, in order for Cloudera Manager to automate installation of Cloudera
Manager Agent packages or CDH and managed service software, cluster hosts must satisfy the following
requirements:
– Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts.
See Networking and Security Requirements on page 9 for further information.
– All hosts must have access to standard package repositories and either archive.cloudera.com or
a local repository with the necessary installation files.
• Production deployments - require you to first manually install and configure a production database for the
Cloudera Manager Server and Hive Metastore. There are two installation options:
– Installation Path B - Manual Installation Using Cloudera Manager Packages on page 106 - you install the
Oracle JDK and Cloudera Manager Server packages on the Cloudera Manager Server host. You have two
options for installing Oracle JDK, Cloudera Manager Agent, CDH, and managed service software on cluster
hosts: manually install it yourself or use Cloudera Manager to automate installation. However, in order
for Cloudera Manager to automate installation of Cloudera Manager Agent packages or CDH and managed
service software, cluster hosts must satisfy the following requirements:
– Allow the Cloudera Manager Server host to have uniform SSH access on the same port to all hosts.
See Networking and Security Requirements on page 9 for further information.
– All hosts must have access to standard package repositories and either archive.cloudera.com or
a local repository with the necessary installation files.
– Installation Path C - Manual Installation Using Cloudera Manager Tarballs on page 122 - you install the
Oracle JDK, Cloudera Manager Server, and Cloudera Manager Agent software as tarballs and use Cloudera
Manager to automate installation of CDH and managed service software as parcels.
Unmanaged Deployment
In an unmanaged deployment, you are responsible for managing all phases of the life cycle of CDH and managed
service components on each host: installation, configuration, and service life cycle operations such as start and
stop. This section describes alternatives for installing CDH 5 software in an unmanaged deployment.
• Command-line methods:
– Download and install the CDH 5 "1-click Install" package
– Add the CDH 5 repository
– Build your own CDH 5 repository
If you use one of these command-line methods, the first (downloading and installing the "1-click Install"
package) is recommended in most cases because it is simpler than building or adding a repository. See
Installing the Latest CDH 5 Release on page 166 for detailed instructions for each of these options.
• Tarball You can download a tarball from CDH downloads. Keep the following points in mind:
– Installing CDH 5 from a tarball installs YARN.
– In CDH 5, there is no separate tarball for MRv1. Instead, the MRv1 binaries, examples, etc., are delivered
in the Hadoop tarball. The scripts for running MRv1 are in the bin-mapreduce1 directory in the tarball,
and the MRv1 examples are in the examples-mapreduce1 directory.
Requirements
• Install a supported version:
– CDH 5 - Supported JDK Versions on page 21
Important:
• You cannot upgrade from JDK 1.7 to JDK 1.8 while upgrading to CDH 5.3. The cluster must already
be running CDH 5.3 when you upgrade to JDK 1.8.
• On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES
distribution. CDH does not run correctly with that version.
export JAVA_HOME=/usr/java/jdk.1.7.0_nn
The six phases are grouped into three installation paths based on how the Cloudera Manager Server and database
software are installed on the Cloudera Manager Server and cluster hosts. The criteria for choosing an installation
path are discussed in Cloudera Manager Deployment on page 40.
• Installation paths B and C - Cloudera Manager package repositories for manually installing the Cloudera
Manager Server, Agent, and embedded database packages.
• Installation path B - The Cloudera Manager Installation wizard for automating installation of Cloudera
Manager Agent package.
• All installation paths - The Cloudera Manager Installation wizard for automating CDH and managed service
installation and configuration on the cluster hosts. Cloudera Manager provides two methods for installing
CDH and managed services: parcels and packages. Parcels simplify the installation process and allow you to
download, distribute, and activate new versions of CDH and managed services from within Cloudera Manager.
After you install Cloudera Manager and you connect to the Cloudera Manager Admin Console for the first
time, use the Cloudera Manager Installation wizard to:
1. Discover cluster hosts
2. Optionally install the Oracle JDK
3. Optionally install CDH, managed service, and Cloudera Manager Agent software on cluster hosts
4. Select services
5. Map service roles to hosts
6. Edit service configurations
7. Start services
If you abort the software installation process, the Installation wizard automatically reverts and rolls back the
installation process for any uninstalled components. (Installation that has completed successfully on a host is
not rolled back on that host.)
Required Databases
The Cloudera Manager Server, Oozie Server, Sqoop Server, Activity Monitor, Reports Manager, Hive Metastore
Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server all require
databases. The type of data contained in the databases and their estimated sizes are as follows:
• Cloudera Manager - Contains all the information about services you have configured and their role
assignments, all configuration history, commands, users, and running processes. This relatively small database
(<100 MB) is the most important to back up.
Important: When processes restart, the configuration for each of the services is redeployed using
information that is saved in the Cloudera Manager database. If this information is not available,
your cluster will not start or function correctly. You must therefore schedule and maintain regular
backups of the Cloudera Manager database in order to recover the cluster in the event of the loss
of this database. See Backing Up Databases on page 65.
• Oozie Server - Contains Oozie workflow, coordinator, and bundle data. Can grow very large.
• Sqoop Server - Contains entities such as the connector, driver, links and jobs. Relatively small.
• Activity Monitor - Contains information about past activities. In large clusters, this database can grow large.
Configuring an Activity Monitor database is only necessary if a MapReduce service is deployed.
• Reports Manager - Tracks disk utilization and processing activities over time. Medium-sized.
• Hive Metastore Server - Contains Hive metadata. Relatively small.
• Sentry Server - Contains authorization metadata. Relatively small.
• Cloudera Navigator Audit Server - Contains auditing information. In large clusters, this database can grow
large.
• Cloudera Navigator Metadata Server - Contains authorization, policies, and audit report metadata. Relatively
small.
The Cloudera Manager Service Host Monitor and Service Monitor roles have an internal datastore.
Cloudera Manager provides three installation paths:
• Path A automatically installs an embedded PostgreSQL database to meet the requirements of the services.
This path reduces the number of installation tasks to complete and choices to make. In Path A you can
optionally choose to create external databases for Oozie Server, Activity Monitor, Reports Manager, Hive
Metastore Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server.
If you choose to use PostgreSQL for Sqoop Server you must create an external database.
• Path B and Path C require you to create databases for the Cloudera Manager Server, Oozie Server, Activity
Monitor, Reports Manager, Hive Metastore Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera
Navigator Metadata Server. If you choose to use PostgreSQL for Sqoop Server you must create an external
database.
Using an external database requires more input and intervention as you install databases or gather information
about existing ones. These paths also provide greater flexibility in choosing database types and configurations.
Cloudera Manager supports deploying different types of databases in a single environment, but doing so can
create unexpected complications. Cloudera recommends choosing one supported database provider for all of
the Cloudera databases.
In most cases, you should install databases and services on the same host. For example, if you create the
database for Activity Monitor on myhost1, then you should typically assign the Activity Monitor role to myhost1.
You assign the Activity Monitor and Reports Manager roles in the Cloudera Manager wizard during the installation
or upgrade process. After completing the installation or upgrade process, you can also modify role assignments
in the Management services pages of Cloudera Manager. Although the database location is changeable, before
beginning an installation or upgrade, you should decide which hosts to use. The JDBC connector for your database
must be installed on the hosts where you assign the Activity Monitor and Reports Manager roles.
You can install the database and services on different hosts. Separating databases from services is more likely
in larger deployments and in cases where more sophisticated database administrators choose such a
configuration. For example, databases and services might be separated if your environment includes Oracle
databases that are managed separately by Oracle database administrators.
OS Command
Red Hat-compatible, if $ sudo yum install cloudera-manager-server-db-2
you have a yum repo
configured
Red Hat-compatible, if sudo yum --nogpgcheck localinstall
you're transferring RPMs cloudera-manager-server-db-2.noarch.rpm
manually
SLES $ sudo zypper install cloudera-manager-server-db-2
/usr/share/cmf/schema/scm_prepare_database.sh
• Tarball install
<tarball root>/share/cmf/schema/scm_prepare_database.sh
/etc/cloudera-scm-server/db.mgmt.properties
• Tarball install
<tarball root>/etc/cloudera-scm-server/db.mgmt.properties
Return to (Optional) Manually Install the Oracle JDK, Cloudera Manager Agent, and CDH and Managed Service
Packages on page 108.
scm_prepare_database.sh Syntax
Note: You can also run scm_prepare_database.sh without options to see the syntax.
Parameter Description
database-type One of the supported database types:
• MySQL - mysql
• Oracle - oracle
• PostgreSQL - postgresql
database-name The name of the Cloudera Manager Server database to create or use.
username The username for the Cloudera Manager Server database to create or use.
password The password for the Cloudera Manager Server database to create or use. If you
do not specify the password on the command line, the script prompts you to enter
it.
Table 5: Options
Option Description
-h or --host The IP address or hostname of the host where the database is installed. The default
is to use the local host.
-P or --port The port number to use to connect to the database. The default port is 3306 for
MySQL, 5432 for PostgreSQL, and 1521 for Oracle. This option is used for a remote
connection only.
-u or --user The admin username for the database application. For -u, no space occurs between
the option and the provided value. If this option is supplied, the script creates a
user and database for the Cloudera Manager Server; otherwise, it uses the user
and database you created previously.
-p or --password The admin password for the database application. The default is no password. For
-p, no space occurs between the option and the provided value.
--scm-host The hostname where the Cloudera Manager Server is installed. Omit if the Cloudera
Manager Server and the database are installed on the same host.
--config-path The path to the Cloudera Manager Server configuration files. The default is
/etc/cloudera-scm-server.
--schema-path The path to the Cloudera Manager schema files. The default is
/usr/share/cmf/schema (the location of the script).
mysql> grant all on *.* to 'temp'@'%' identified by 'temp' with grant option;
Query OK, 0 rows affected (0.00 sec)
Example 3: Running the script when PostgreSQL is co-located with the Cloudera Manager Server
This example assumes that you have already created the Cloudera Management Server database and database
user, naming both scm.
External Databases for Oozie Server, Sqoop Server, Activity Monitor, Reports Manager, Hive
Metastore Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator
Metadata Server
You can configure Cloudera Manager to use an external database for Oozie Server, Sqoop Server, Activity Monitor,
Reports Manager, Hive Metastore Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator
Metadata Server. If you choose this option, you must create the databases before you run the Cloudera Manager
installation wizard. For more information, see the instructions in Configuring External Databases for Oozie on
page 62, Configuring an External Database for Sqoop on page 64, MySQL Database on page 55, Oracle Database
on page 59, and External PostgreSQL Database on page 51.
If you are using Installation Path B - Manual Installation Using Cloudera Manager Packages on page 106 and you
want to use an embedded PostgreSQL database for the Cloudera Management Server, use this procedure to
install and start the database:
1. Install the embedded PostgreSQL database packages:
OS Command
Red Hat-compatible, if $ sudo yum install cloudera-manager-server-db-2
you have a yum repo
configured
Red Hat-compatible, if sudo yum --nogpgcheck localinstall
you're transferring RPMs cloudera-manager-server-db-2.noarch.rpm
manually
SLES $ sudo zypper install cloudera-manager-server-db-2
To find information about the PostgreSQL database account that the Cloudera Manager Server uses, read the
/etc/cloudera-scm-server/db.properties file:
# cat /etc/cloudera-scm-server/db.properties
Auto-generated by scm_prepare_database.sh
#
Sat Oct 1 12:19:15 PDT 201
#
com.cloudera.cmf.db.type=postgresql
com.cloudera.cmf.db.host=localhost:7432
com.cloudera.cmf.db.name=scm
com.cloudera.cmf.db.user=scm
com.cloudera.cmf.db.password=TXqEESuhj5
# cat /var/lib/cloudera-scm-server-db/data/generated_password.txt
MnPwGeWaip
2. On the host on which the Cloudera Manager Server is running, log into PostgreSQL as the root user:
postgres=#
postgres=# \l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access
privileges
-----------+--------------+----------+------------+------------+-----------------------------------
:
"cloudera-scm"=CTc/"cloudera-scm"
template1 | cloudera-scm | UTF8 | en_US.UTF8 | en_US.UTF8 | =c/"cloudera-scm"
:
"cloudera-scm"=CTc/"cloudera-scm"
(9 rows)
4. Set the password for an owner using the \password command. For example, to set the password for the
amon owner, do the following:
Note:
• If you already have a PostgreSQL database set up, you can skip to the section Configuring and
Starting the PostgreSQL Server on page 52 to verify that your PostgreSQL configurations meet
the requirements for Cloudera Manager.
• Make sure that the data directory, which by default is /var/lib/postgresql/data/, is on a
partition that has sufficient free space.
export LANGUAGE=en_US.UTF-8
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
locale-gen en_US.UTF-8
dpkg-reconfigure locales
• SLES
Note: This command will install PostgreSQL 9.1. If you want to install a different version, you
can use zypper search postgresql to search for available versions. You should install
version 8.4 or higher.
• Debian/Ubuntu
then the host line specifying md5 authentication shown above must be inserted before this ident line.
Failure to do so may cause an authentication error when running the scm_prepare_database.sh script.
You can modify the contents of the md5 line shown above to support different configurations. For example,
if you want to access PostgreSQL from a different host, replace 127.0.0.1 with your IP address and update
postgresql.conf, which is typically found in the same place as pg_hba.conf, to include:
listen_addresses = '*'
3. Configure settings to ensure your system performs as expected. Update these settings in the
/var/lib/pgsql/data/postgresql.conf or /var/lib/postgresql/data/postgresql.conf file. Settings
vary based on cluster size and resources as follows:
• Small to mid-sized clusters - Consider the following settings as starting points. If resources are limited,
consider reducing the buffer sizes and checkpoint segments further. Ongoing tuning may be required
based on each host's resource utilization. For example, if the Cloudera Manager Server is running on the
same host as other roles, the following values may be acceptable:
– shared_buffers - 256MB
– wal_buffers - 8MB
– checkpoint_segments - 16
– checkpoint_completion_target - 0.9
• Large clusters - Can contain up to 1000 hosts. Consider the following settings as starting points.
– max_connection - For large clusters, each database is typically hosted on a different host. In general,
allow each database on a host 100 maximum connections and then add 50 extra connections. You
may have to increase the system resources available to PostgreSQL, as described at Connection
Settings.
– shared_buffers - 1024 MB. This requires that the operating system can allocate sufficient shared
memory. See PostgreSQL information on Managing Kernel Resources for more information on setting
kernel resources.
– wal_buffers - 16 MB. This value is derived from the shared_buffers value. Setting wal_buffers
to be approximately 3% of shared_buffers up to a maximum of approximately 16 MB is sufficient in
most cases.
– checkpoint_segments - 128. The PostgreSQL Tuning Guide recommends values between 32 and 256
for write-intensive systems, such as this one.
– checkpoint_completion_target - 0.9. This setting is only available in PostgreSQL versions 8.3 and
later, which are highly recommended.
• SLES
• Debian/Ubuntu
Creating Databases for Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server, Cloudera
Navigator Audit Server, and Cloudera Navigator Metadata Server
Create databases and user accounts for components that require databases:
• If you are not using the Cloudera Manager installer, the Cloudera Manager Server.
• Cloudera Management Service roles:
– Activity Monitor (if using the MapReduce service)
– Reports Manager
• Each Hive metastore
• Sentry Server
• Cloudera Navigator Audit Server
• Cloudera Navigator Metadata Server
You can create these databases on the host where the Cloudera Manager Server will run, or on any other hosts
in the cluster. For performance reasons, you should install each database on the host on which the service runs,
as determined by the roles you assign during installation or upgrade. In larger deployments or in cases where
database administrators are managing the databases the services use, you can separate databases from services,
but use caution.
The database must be configured to support UTF-8 character set encoding.
Record the values you enter for database names, user names, and passwords. The Cloudera Manager installation
wizard requires this information to correctly connect to these databases.
1. Connect to PostgreSQL:
2. If you are not using the Cloudera Manager installer, create a database for the Cloudera Manager Server. The
database name, user name, and password can be any value. Record the names chosen because you will need
them later when running the scm_prepare_database.sh script.
3. Create databases for Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server, Cloudera
Navigator Audit Server, and Cloudera Navigator Metadata Server:
where user, password, and databaseName can be any value. The examples shown match the default names
provided in the Cloudera Manager configuration settings:
MySQL Database
To use a MySQL database, follow these procedures.
Installing the MySQL Server
Note:
• If you already have a MySQL database set up, you can skip to the section Configuring and Starting
the MySQL Server on page 55 to verify that your MySQL configurations meet the requirements
for Cloudera Manager.
• It is important that the datadir directory, which, by default, is /var/lib/mysql, is on a partition
that has sufficient free space.
OS Command
RHEL $ sudo yum install mysql-server
Note: Some SLES systems encounter errors when using the preceding
zypper install command. For more information on resolving this issue,
see the Novell Knowledgebase topic, error running chkconfig.
After issuing the command to install MySQL, you may need to confirm that you want to complete the
installation.
Configuring and Starting the MySQL Server
1. Determine the version of MySQL.
2. Stop the MySQL server if it is running.
OS Command
RHEL $ sudo service mysqld stop
• The default settings in the MySQL installations in most distributions use conservative buffer sizes and
memory usage. Cloudera Management Service roles need high write throughput because they might
insert many records in the database. Cloudera recommends that you set the innodb_flush_method
property to O_DIRECT.
• Set the max_connections property according to the size of your cluster:
– Small clusters (fewer than 50 hosts) - You can store more than one database (for example, both the
Activity Monitor and Service Monitor) on the same host. If you do this, you should:
– Put each database on its own storage volume.
– Allow 100 maximum connections for each database and then add 50 extra connections. For example,
for two databases, set the maximum connections to 250. If you store five databases on one host
(the databases for Cloudera Manager Server, Activity Monitor, Reports Manager, Cloudera Navigator,
and Hive metastore), set the maximum connections to 550.
– Large clusters (more than 50 hosts) - Do not store more than one database on the same host. Use a
separate host for each database/host pair. The hosts need not be reserved exclusively for databases,
but each database should be on a separate host.
• Binary logging is not a requirement for Cloudera Manager installations. Binary logging provides benefits
such as MySQL replication or point-in-time incremental recovery after database restore. Examples of
this configuration follow. For more information, see The Binary Log.
Here is an option file with Cloudera recommended settings:
[mysqld]
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
# symbolic-links = 0
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1
max_connections = 550
# For MySQL version 5.1.8 or later. Comment out binlog_format for older versions.
binlog_format = mixed
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M
# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
6. If AppArmor is running on the host where MySQL is installed, you might need to configure AppArmor to allow
MySQL to write to the binary.
7. Ensure the MySQL server starts at boot.
OS Command
RHEL $ sudo /sbin/chkconfig mysqld on
$ sudo /sbin/chkconfig --list mysqld
mysqld 0:off 1:off 2:on 3:on 4:on 5:on
6:off
OS Command
RHEL $ sudo service mysqld start
9. Set the MySQL root password. In the following example, the current root password is blank. Press the Enter
key when you're prompted for the root password.
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!
Note: If you already have the JDBC driver installed on the hosts that need it, you can skip this section.
However, MySQL 5.6 requires a driver version 5.1.26 or higher.
Cloudera recommends that you assign all roles that require databases on the same host and install the driver
on that host. Locating all such roles on the same host is recommended but not required. If you install a role,
such as Activity Monitor, on one host and other roles on a separate host, you would install the JDBC driver on
each host running roles that access the database.
OS Command
RHEL 5 or 6 1. Download the MySQL JDBC driver from
http://www.mysql.com/downloads/connector/j/5.1.html.
2. Extract the JDBC driver JAR file from the downloaded file. For example:
3. Copy the JDBC driver, renamed, to the relevant host. For example:
$ sudo cp
mysql-connector-java-5.1.31/mysql-connector-java-5.1.31-bin.jar
/usr/share/java/mysql-connector-java.jar
If the target directory does not yet exist on this host, you can create
it before copying the JAR file. For example:
$ mysql -u root -p
Enter password:
2. Create databases for the Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server, Cloudera
Navigator Audit Server, and Cloudera Navigator Metadata Server:
database, user, and password can be any value. The examples match the default names provided in the
Cloudera Manager configuration settings:
Oracle Database
To use an Oracle database, follow these procedures.
Collecting Oracle Database Information
To configure Cloudera Manager to work with Oracle databases, get the following information from your Oracle
DBA:
• Host name - The DNS name or the IP address of the host where the Oracle database is installed.
• SID - The name of the database that will store Cloudera Manager information.
• Username - A username for each schema that is storing information. You could have four unique usernames
for the four schema.
• Password - A password corresponding to each user name.
Configuring the Oracle Server
For example, if a host has two databases, you anticipate 250 maximum connections. If you anticipate a maximum
of 250 connections, plan for 280 sessions.
Once you know the number of sessions, you can determine the number of anticipated transactions using the
following formula:
Continuing with the previous example, if you anticipate 280 sessions, you can plan for 308 transactions.
Work with your Oracle database administrator to apply these derived values to your system.
Using the sample values above, Oracle attributes would be set as follows:
2. Copy the appropriate JDBC JAR file to /usr/share/java/oracle-connector-java.jar for use with the
Cloudera Manager databases (for example, for the Activity Monitor, and so on), and for use with Hive.
Creating Databases for the Cloudera Manager Server, Activity Monitor, Reports Manager, Hive Metastore Server,
Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator Metadata Server
Create databases and user accounts for components that require databases:
• If you are not using the Cloudera Manager installer, the Cloudera Manager Server.
• Cloudera Management Service roles:
– Activity Monitor (if using the MapReduce service)
– Reports Manager
• Each Hive metastore
• Sentry Server
• Cloudera Navigator Audit Server
• Cloudera Navigator Metadata Server
You can create these databases on the host where the Cloudera Manager Server will run, or on any other hosts
in the cluster. For performance reasons, you should install each database on the host on which the service runs,
as determined by the roles you assign during installation or upgrade. In larger deployments or in cases where
database administrators are managing the databases the services use, you can separate databases from services,
but use caution.
The database must be configured to support UTF-8 character set encoding.
Record the values you enter for database names, user names, and passwords. The Cloudera Manager installation
wizard requires this information to correctly connect to these databases.
1. Log into the Oracle client:
sqlplus system@localhost
Enter password: ******
where username and password are the credentials you specified in Preparing a Cloudera Manager Server
External Database on page 46.
3. Grant a space quota on the tablespace (the default tablespace is SYSTEM) where tables will be created:
4. Create databases for Activity Monitor, Reports Manager, Hive Metastore Server, Sentry Server, Cloudera
Navigator Audit Server, and Cloudera Navigator Metadata Server:database, user, and password can be any
value. The examples match the default names provided in the Cloudera Manager configuration settings:
5. For each user in the table in the preceding step, create a user and add privileges for the each user:
6. Grant a space quota on the tablespace (the default tablespace is SYSTEM) where tables will be created:
For further information about Oracle privileges, see Authorization: Privileges, Roles, Profiles, and Resource
Limitations.
Return to Establish Your Cloudera Manager Repository Strategy on page 106.
$ psql -U postgres
Password for user postgres: *****
CREATE ROLE
postgres=# \q
MySQL
$ mysql -u root -p
Enter password: ******
mysql> exit
Bye
Note: You must manually download the MySQL JDBC driver JAR file.
Oracle
$ sqlplus system@localhost
SQL> create user oozie identified by oozie default tablespace users temporary tablespace
temp;
User created.
SQL> exit
Important:
Do not make the following grant:
Note: You must manually download the Oracle JDBC driver JAR file.
Note:
There is currently no recommended way to migrate data from an existing Derby database into the
new PostgreSQL database.
Use the procedure that follows to configure Sqoop 2 to use PostgreSQL instead of Apache Derby.
$ psql -U postgres
Password for user postgres: *****
postgres=# \q
Required Role:
1. Go to the Sqoop service.
2. Click the Configuration tab.
3. Select Scope > Sqoop 2 Server.
4. Select Category > Database.
5. Set the following properties:
• Sqoop Repository Database Type - postgresql
• Sqoop Repository Database Host - the hostname on which you installed the PostgreSQL server. If the
port is non-default for your database type, use host:port notation.
• Sqoop Repository Database Name, User, Password - the properties you specified in Create the Sqoop
User and Sqoop Database on page 65.
6. Click Save Changes to commit the changes.
7. Restart the service.
Backing Up Databases
Cloudera recommends that you schedule regular backups of the databases that Cloudera Manager uses to store
configuration, monitoring, and reporting data and for managed services that require a database:
• Cloudera Manager - Contains all the information about services you have configured and their role
assignments, all configuration history, commands, users, and running processes. This relatively small database
(<100 MB) is the most important to back up.
Important: When processes restart, the configuration for each of the services is redeployed using
information that is saved in the Cloudera Manager database. If this information is not available,
your cluster will not start or function correctly. You must therefore schedule and maintain regular
backups of the Cloudera Manager database in order to recover the cluster in the event of the loss
of this database. See Backing Up Databases on page 65.
• Oozie Server - Contains Oozie workflow, coordinator, and bundle data. Can grow very large.
• Sqoop Server - Contains entities such as the connector, driver, links and jobs. Relatively small.
• Activity Monitor - Contains information about past activities. In large clusters, this database can grow large.
Configuring an Activity Monitor database is only necessary if a MapReduce service is deployed.
• Reports Manager - Tracks disk utilization and processing activities over time. Medium-sized.
• Hive Metastore Server - Contains Hive metadata. Relatively small.
• Sentry Server - Contains authorization metadata. Relatively small.
• Cloudera Navigator Audit Server - Contains auditing information. In large clusters, this database can grow
large.
• Cloudera Navigator Metadata Server - Contains authorization, policies, and audit report metadata. Relatively
small.
Backing Up PostgreSQL Databases
To back up a PostgreSQL database, use the same procedure whether the database is embedded or external:
1. Log in to the host where the Cloudera Manager Server is installed.
2. Get the name, user, and password properties for the Cloudera Manager database from
/etc/cloudera-scm-server/db.properties:
com.cloudera.cmf.db.name=scm
com.cloudera.cmf.db.user=scm
com.cloudera.cmf.db.password=NnYfWIjlbk
3. Run the following command as root using the parameters from the preceding step:
For example, to back up the Activity Monitor database amon created in Creating Databases for Activity Monitor,
Reports Manager, Hive Metastore Server, Sentry Server, Cloudera Navigator Audit Server, and Cloudera Navigator
Metadata Server on page 58, on the local host as the root user, with the password amon_password:
To back up the sample Activity Monitor database amon on remote host myhost.example.com as the root user,
with the password amon_password:
The default value is small, so you should examine disk usage after several days of activity to determine how
much space they need. The Charts Library tab on the Cloudera Management Service page shows the current
disk space consumed and its rate of growth, categorized by the type of data stored. For example, you can compare
the space consumed by raw metric data to daily summaries of that data.
Viewing Host and Service Monitor Data Storage
The Cloudera Management Service page shows the current disk space consumed and its rate of growth,
categorized by the type of data stored. For example, you can compare the space consumed by raw metric data
to daily summaries of that data:
1. Select Clusters > Cloudera Management Service.
2. Click the Charts Library tab.
Data Granularity and Time-Series Metric Data
The Service Monitor and Host Monitor store time-series metric data in a variety of ways. When the data is
received, it is written as-is to the metric store. Over time, the raw data is summarized to and stored at various
data granularities. For example, after ten minutes, a summary point is written containing the average of the
metric over the period as well as the minimum, the maximum, the standard deviation, and a variety of other
statistics. This process is summarized to produce hourly, six-hourly, daily, and weekly summaries. This data
summarization procedure applies only to metric data. When the Impala query and YARN application monitoring
storage limit is reached, the oldest stored records are deleted.
The Service Monitor and Host Monitor internally manage the amount of overall storage space dedicated to each
data granularity level. When the limit for a level is reached, the oldest data points at that level are deleted. Metric
data for that time period remains available at the lower granularity levels. For example, when an hourly point
for a particular time is deleted to free up space, a daily point still exists covering that hour. Because each of
these data granularities consumes significantly less storage than the previous summary level, lower granularity
levels can be retained for longer periods of time. With the recommended amount of storage, weekly points can
often be retained indefinitely.
Some features, such as detailed display of health results, depend on the presence of raw data. Health history
is maintained by the event store dictated by its retention policies.
Moving Monitoring Data on an Active Cluster
You can change where monitoring data is stored on a cluster.
ln -s /data/2/impala_data /data/1/service_monitor/impala
Required Recommended
Java Heap Size 256 MB 512 MB
Non-Java Memory 768 MB 1.5 GB
Required Recommended
Java Heap Size 1 GB 2 GB
Non-Java Memory 2 GB 4 GB
Required Recommended
Java Heap Size 2 GB 4 GB
Non-Java Memory 6 GB 12 GB
• The cluster experiences data loss due to filling storage locations to 100% of capacity. The resulting damage
from such an event can impact many other components.
There is a main theme here: you need to architect your data storage needs well in advance. You need to inform
your operations staff about your critical data storage locations for each host so that they can provision your
infrastructure adequately and back it up appropriately. Make sure to document the discovered requirements in
your build documentation and run books.
This topic describes both local disk storage and RDMBS storage and these types of storage are labeled within
the discussions. This distinction is made both for storage planning and also to inform migration of roles from
one host to another, preparing backups, and other lifecycle management events. Note that storage locations
and types have changed for some roles between Cloudera Manager versions 4 and 5.
The following tables provide details about each individual Cloudera Management service with the goal of enabling
Cloudera Manager Administrators to make appropriate storage and lifecycle planning decisions.
Cloudera Manager Server
Storage Configuration Defaults, There are no direct storage defaults relevant to this entity.
Minimum, or Maximum
Where to Control Data Retention or The size of the Cloudera Manager Server database varies depending on
Size the number of managed hosts and the number of discrete commands
that have been run in the cluster. To configure the size of the retained
command results in the Cloudera Manager Administration Console, select
Administration > Settings and edit the following property:
Command Eviction Age
Length of time after which inactive commands are evicted from
the database.
Default is two years.
Sizing, Planning & Best Practices The Cloudera Manager Server database is the most vital configuration
store in a Cloudera Manager deployment. This database holds the
configuration for clusters, services, roles, and other necessary information
that defines a deployment of Cloudera Manager and its managed hosts.
You should perform regular, verified, remotely-stored backups of the
Cloudera Manager Server database.
Sizing, Planning, and Best Practices The Activity Monitor only monitors MapReduce jobs, and does not monitor
not YARN applications. If you no longer use MapReduce (MRv1) in your
cluster, the Activity Monitor is not required for Cloudera Manager 5 (or
later) or CDH 5 (or later).
The amount of storage space needed for 14 days worth of MapReduce
activities can vary greatly and directly depends on a the size of your cluster
and the level of activity that uses MapReduce. It may be necessary to
adjust and readjust the amount of storage as you determine the "stable
state" and "burst state" of the MapReduce activity in your cluster.
For example, consider the following test cluster and usage:
• A simulated 1000-host cluster, each host with 32 slots
• Synthetic MapReduce jobs with 200 attempts (tasks) per activity (job)
Where to Control Data Retention or Service Monitor data growth is controlled by configuring the maximum
Size amount of storage space it may use.
To configure data retention in Cloudera Manager Administration Console:
1. Go the Cloudera Management Service.
2. Click the Configuration tab.
3. Select Scope > Service Monitor or Cloudera Management Service
(Service-Wide).
4. Select Category > Main.
5. Locate the propertyName property or search for it by typing its name
in the Search box.
Time-Series Storage
The approximate amount of disk space dedicated to storing time series
and health data. When the store has reached its maximum size, it
deletes older data to make room for newer data. The disk usage is
approximate because the store only begins deleting data once it
reaches the limit.
Note that Cloudera Manager stores time-series data at a number of
different data granularities, and these granularities have different
effective retention periods. The Service Monitor stores metric data not
only as raw data points but also as ten-minute, hourly, six-hourly,
daily, and weekly summary data points. Raw data consumes the bulk
of the allocated storage space and weekly summaries consume the
least. Raw data is retained for the shortest amount of time while
weekly summary points are unlikely to ever be deleted.
Select Cloudera Management Service > Charts Library tab in Cloudera
Manager for information about how space is consumed within the
Service Monitor. These pre-built charts also show information about
Sizing, Planning, and Best Practices The Service Monitor gathers metrics about configured roles and services
in your cluster and also runs active health tests. These health tests run
regardless of idle and use periods, because they are always relevant. The
Service Monitor gathers metrics and health test results regardless of the
level of activity in the cluster. This data continues to grow, even in an idle
cluster.
Sizing, Planning and Best Practices The Host Monitor gathers metrics about host-level items of interest (for
example: disk space usage, RAM, CPU usage, swapping, etc) and also
informs host health tests. The Host Monitor gathers metrics and health
test results regardless of the level of activity in the cluster. This data
continues to grow fairly linearly, even in an idle cluster.
Sizing, Planning, and Best Practices The Event Server is a managed Lucene index that collects relevant events
that happen within your cluster, such as results of health tests, log events
that are created when a log entry matches a set of rules for identifying
messages of interest and makes them available for searching, filtering
and additional action. You can view and filter events on the Diagnostics
> Events tab of the Cloudera Manager Administration Console. You can
also poll this data using the Cloudera Manager API.
Where to Control Data Retention or The Reports Manager uses space in two main locations, one local on the
Minimum / Maximum host where Reports Manager runs, and the other in the RDBMS provided
to it for its historical aggregation. The RDBMS is not required to be on the
same host where the Reports Manager runs.
Sizing, Planning, and Best Practices Reports Manager downloads the fsimage from the Namenode every 60
minutes (default) and stores it locally to perform operations against,
including indexing the HDFS filesystem structure represented in the
fsimage. A larger fsimage, or more deep and complex paths within HDFS
consume more disk space.
Reports Manager has no control over the size of the fsimage. If your total
HDFS usage trends upward notably or you add excessively long paths in
HDFS, it may be necessary to revisit and adjust the amount of space
allocated to the Reports Manager for its local storage. Periodically monitor,
review and readjust the local storage allocation.
Cloudera Navigator
Sizing, Planning, and Best Practices The size of the Navigator Audit Server database directly depends on the
number of audit events the cluster’s audited services generate. Normally
the volume of HDFS audits exceed the volume of other audits (all other
components like MRv1, Hive and Impala read from HDFS, which generates
additional audit events).
The average size of a discrete HDFS audit event is ~1 KB. For a busy cluster
of 50 hosts with ~100K audit events generated per hour, the Navigator
Audit Server database would consume ~2.5 GB per day. To retain 90 days
of audits at that level, plan for a database size of around 250 GB. If other
configured cluster services generate roughly the same amount of data
as the HDFS audits, plan for the Navigator Audit Server database to require
around 500 GB of storage for 90 days of data.
Notes:
• Individual Hive and Impala queries themselves can be very large. Since
the query itself is part of an audit event, such audit events consume
space in proportion to the length of the query.
• The amount of space required increases as activity on the cluster
increases. In some cases, Navigator Audit Server databases can exceed
1TB for 90 days of audit events. Benchmark your cluster periodically
and adjust accordingly.
Parcel Cache Managed Hosts running a Cloudera Manager Agent stage distributed
parcels into this path (as .parcel files, unextracted). Do not manually
/opt/cloudera/parcel-cache
manipulate this directory or its files.
Sizing and Planning
Provide sufficient space per-host to hold all the parcels you distribute to
each host.
Host Parcel Directory Managed cluster hosts running a Cloudera Manager Agent extract parcels
from the /opt/cloudera/parcel-cache directory into this path upon
/opt/cloudera/parcels
parcel activation. Many critical system symlinks point to files in this path
and you should never manually manipulate its contents.
Sizing and Planning
Provide sufficient space on each host to hold all the parcels you distribute
to each host. Be aware that the typical CDH parcel size is slightly larger
than 1 GB per parcel. If you maintain various versions of parcels staged
before and after upgrading, be aware of the disk space implications.
You can configure Cloudera Manager to automatically remove older parcels
once they are no longer in use. As an administrator you can always
manually delete parcel versions not in use, but configuring these settings
can handle the deletion automatically, in case you forget.
To configure this behavior in the Cloudera Manager Administration Console,
select Administration > Settings > Parcels and configure the following
property:
Automatically Remove Old Parcels
This parameter controls whether parcels for old versions of an
activated product should be removed from a cluster when they are
no longer in use.
The default value is Disabled.
Number of Old Parcel Versions to Retain
If you enable Automatically Remove Old Parcels, this setting
specifies the number of old parcels to keep. Any old parcels beyond
this value are removed. If this property is set to zero, no old parcels
are retained.
The default value is 3.
Task Description
Activity Monitor (One-time) The Activity Monitor only works against a MapReduce (MR1) service, not
YARN. So if your deployment has fully migrated to YARN and no longer
uses a MapReduce (MR1) service, your Activity Monitor database is no
longer growing. If you have waited longer than the default Activity Monitor
retention period (14 days) to address this point, then the Activity Monitor
has already purged it all for you and your database is mostly empty. If
your deployment meets these conditions, consider cleaning up by dropping
Task Description
the Activity Monitor database (again, only when you are satisfied that you
no longer need the data or have confirmed that it is no longer in use).
Service Monitor and Host Monitor For those who used Cloudera Manager version 4.x and have now upgraded
(One-time) to version 5.x: In your pre-upgrade planning, you likely saw a warning in
the Upgrade Guide advising that Cloudera Manager did this migration
work for you automatically. The Service Monitor and Host Monitor are
migrated from their previously-configured RDBMS into a dedicated time
series store used solely by each of these roles respectively. After this
happens, there is still legacy database connection information in the
configuration for these roles. This was used to allow for the initial
migration but is no longer being used for any active work.
After the above migration has taken place, the RDBMS databases
previously used by the Service Monitor and Host Monitor are no longer
used. Space occupied by these databases is now recoverable. If appropriate
in your environment (and you are satisfied that you have long-term
backups or do not need the data on disk any longer), you can drop those
databases using the documented recommendations.
Ongoing Space Reclamation Cloudera Management Services are automatically rolling up, purging or
otherwise consolidating aged data for you in the background. Configure
retention and purging limits per-role to control how and when this occurs.
These configurations are discussed per-entity above. Adjust the default
configurations to meet your space limitations or retention needs.
Conclusion
Keep this information in mind for planning and architecting the deployment of a cluster managed by Cloudera
Manager. If you already have a live cluster, find lifecycle and backup information that can help you keep critical
monitoring, auditing and metadata sources safe and properly backed up.
Parcels
Required Role:
A parcel is a binary distribution format containing the program files, along with additional metadata used by
Cloudera Manager. There are a few notable differences between parcels and packages:
• Parcels are self-contained and installed in a versioned directory, which means that multiple versions of a
given parcel can be installed side-by-side. You can then designate one of these installed versions as the
active one. With packages, only one package can be installed at a time so there's no distinction between
what's installed and what's active.
• Parcels can be installed at any location in the filesystem and by default are installed in
/opt/cloudera/parcels. In contrast, packages are installed in /usr/lib.
Parcels are available for CDH 4.1.3 or later, and for Impala, Search, Spark, Accumulo, Kafka, Key Trustee KMS, and
Sqoop Connectors.
Advantages of Parcels
As a consequence of their unique properties, parcels offer a number of advantages over packages:
• CDH is distributed as a single object - In contrast to having a separate package for each part of CDH, when
using parcels there is just a single object to install. This is especially useful when managing a cluster that
isn't connected to the Internet.
• Internal consistency - All CDH components are matched so there isn't a danger of different parts coming
from different versions of CDH.
• Installation outside of /usr - In some environments, Hadoop administrators do not have privileges to install
system packages. In the past, these administrators had to fall back to CDH tarballs, which deprived them of
a lot of infrastructure that packages provide. With parcels, administrators can install to /opt or anywhere
else without having to step through all the additional manual steps of regular tarballs.
Note: With parcel software distribution, the path to the CDH libraries is
/opt/cloudera/parcels/CDH/lib instead of the usual /usr/lib. You should not link /usr/lib/
elements to parcel deployed paths, as such links may confuse scripts that distinguish between
the two paths.
• Installation of CDH without sudo - Parcel installation is handled by the Cloudera Manager Agent running as
root so it's possible to install CDH without needing sudo.
• Decouples distribution from activation - Due to side-by-side install capabilities, it is possible to stage a new
version of CDH across the cluster in advance of switching over to it. This allows the longest running part of
an upgrade to be done ahead of time without affecting cluster operations, consequently reducing the downtime
associated with upgrade.
• Rolling upgrades - These are only possible with parcels, due to their side-by-side nature. Packages require
shutting down the old process, upgrading the package, and then starting the new process. This can be hard
to recover from in the event of errors and requires extensive integration with the package management
system to function seamlessly. When a new version is staged side-by-side, switching to a new minor version
is simply a matter of changing which version of CDH is used when restarting each process. It then becomes
practical to do upgrades with rolling restarts, where service roles are restarted in the right order to switch
over to the new version with minimal service interruption. Your cluster can continue to run on the existing
installed components while you stage a new version across your cluster, without impacting your current
operations. Note that major version upgrades (for example, CDH 4 to CDH 5) require full service restarts due
to the substantial changes between the versions. Finally, you can upgrade individual parcels, or multiple
parcels at the same time.
• Easy downgrades - Reverting back to an older minor version can be as simple as upgrading. Note that some
CDH components may require explicit additional steps due to schema upgrades.
• Upgrade management - Cloudera Manager can fully manage all the steps involved in a CDH version upgrade.
In contrast, with packages, Cloudera Manager can only help with initial installation.
• Distributing additional components - Parcels are not limited to CDH. Cloudera Impala, Cloudera Search, LZO,
and add-on service parcels are also available.
• Compatibility with other distribution tools - If there are specific reasons to use other tools for download
and/or distribution, you can do so, and Cloudera Manager will work alongside your other tools. For example,
you can handle distribution with Puppet. Or, you can download the parcel to Cloudera Manager Server manually
(perhaps because your cluster has no Internet connectivity) and then have Cloudera Manager distribute the
parcel to the cluster.
Parcel Life Cycle
To enable upgrades and additions with minimal disruption, parcels participate in the following phases: download,
distribute, unpack, activate, deactivate, remove, and delete.
• Downloading a parcel copies the appropriate software to a local parcel repository on the Cloudera Manager
Server, where it is available for distribution to the other hosts in any of your clusters managed by this Cloudera
Manager Server. You can have multiple parcels for a given product downloaded to your Cloudera Manager
Server. Once a parcel has been downloaded to the Server, it will be available for distribution on all clusters
managed by the Server. A downloaded parcel will appear in the cluster-specific section for every cluster
managed by this Cloudera Manager Server.
• Distributing a parcel copies the parcel to the member hosts of a cluster. Distributing a parcel does not actually
upgrade the components running on your cluster; the current services continue to run unchanged. You can
have multiple parcels distributed on your cluster.
Note: The distribute process does not require Internet access; rather the Cloudera Manager Agent
on each cluster member downloads the parcels from the local parcel repository on the Cloudera
Manager Server.
• Unpacking a parcel extracts the files contained in the parcel archive file.
• Activating a parcel causes the Cloudera Manager to link to the new components, ready to run the new version
upon the next restart. Activation does not automatically stop the current services or perform a restart —
you have the option to restart the service(s) after activation, or you can allow the system administrator to
determine the appropriate time to perform those operations.
• Deactivating a parcel causes Cloudera Manager to unlink from the parcel components. A parcel cannot be
deactivated while it is still in use on one or more hosts.
• Removing a parcel causes Cloudera Manager to remove the parcel components from the hosts.
• Deleting a parcel causes Cloudera Manager to remove the parcel components from the local parcel repository.
Cloudera Manager detects when new parcels are available. The parcel indicator in the Admin Console navigation
bar ( ) indicates how many parcels are eligible for downloading or distribution. For example, CDH parcels older
than the active one do not contribute to the count if you are already using the latest version. If no parcels are
eligible, or if all parcels have been activated, then the indicator will not have a number badge. You can configure
Cloudera Manager to download and distribute parcels automatically, if desired.
Important: If you plan to upgrade CDH you should follow the instructions in Upgrading CDH and
Managed Services Using Cloudera Manager on page 479 because steps in addition to activating the
parcel must be performed in order to successfully upgrade.
Parcel Locations
The default location for the local parcel directory on the Cloudera Manager Server host is
/opt/cloudera/parcel-repo. To change this location, follow the instructions in Configuring Cloudera Manager
Server Parcel Settings on page 88.
The default location for the distributed parcels on the managed hosts is /opt/cloudera/parcels. To change
this location, set the parcel_dir property in /etc/cloudera-scm-agent/config.ini file of the Cloudera
Manager Agent and restart the Cloudera Manager Agent or by following the instructions in Configuring the Host
Parcel Directory on page 89.
Note: With parcel software distribution, the path to the CDH libraries is
/opt/cloudera/parcels/CDH/lib instead of the usual /usr/lib. You should not link /usr/lib/
elements to parcel deployed paths, as such links may confuse scripts that distinguish between the
two paths.
Managing Parcels
Through the Parcels page in Cloudera Manager, you can manage parcel installation and activation and determine
what parcel versions are running across your clusters. The Parcels page displays a list of parcels managed by
Cloudera Manager. Cloudera Manager displays the name, version, and status of each parcel and provides actions
on the parcel.
column displays version information about the parcel. Click the icon to view the release notes for the parcel.
The Actions column contains buttons you can click to perform actions on the parcels such as download, distribute,
delete, deactivate, and remove from host.
Downloading a Parcel
1. Go to the Parcels page. Parcels that are available for download display the Available Remotely status and a
Download button.
If the parcel you want is not shown here — for example, you want to upgrade to version of CDH that is not
the most current version — you can make additional remote parcel repositories available. You can also
configure the location of the local parcel repository and other settings. See Parcel Configuration Settings on
page 88.
If a parcel version is too new to be supported by the Cloudera Manager version, the parcel appears with a
red background and error message:
Such parcels are also listed when you select the Error status in the Error Status section of the Filters selector.
2. Click Download button of the Parcel you want to download to initiate the download of the parcel from the
remote parcel repository to your local repository. The status changes to Downloading.
After a parcel has been downloaded, the parcel is removed from the Available Remotely page.
Note: The parcel download is done at the Cloudera Manager Server, so with multiple clusters, the
downloaded parcels are shown as available to all clusters managed by the Cloudera Manager Server.
However, distribution (to a specific cluster's hosts) must be selected on a cluster-by-cluster basis.
Distributing a Parcel
Parcels that have been downloaded can be distributed to the hosts in your cluster, available for activation.
From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the Distribute
button for the parcel you want to distribute. The status changes to Distributing. During distribution, you can
click the Details link in the Status column to view the Parcel Distribution Status page. Click the Cancel button
to cancel the distribution. When the Distribute action completes, the button changes to Activate and you can
click the Distributed status link to view the status page.
Distribution does not require Internet access; rather the Cloudera Manager Agent on each cluster member
downloads the parcel from the local parcel repository hosted on the Cloudera Manager Server.
If you have a large number of hosts to which the parcels should be distributed, you can control how many
concurrent uploads Cloudera Manager will perform. See Parcel Configuration Settings on page 88.
To delete a parcel that is ready to be distribute, click the triangle at the right end of the Distribute button and
select Delete. This action deletes the downloaded parcel from the local parcel repository.
Distributing parcels to the hosts in the cluster does not affect the current running services.
Activating a Parcel
Parcels that have been distributed to the hosts in a cluster are ready to be activated.
1. From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the Activate
button for the parcel you want to activate. This action updates Cloudera Manager to point to the new software,
ready to be run the next time a service is restarted.
2. A pop-up warns you that your currently running process will not be affected until you restart, and gives you
the option to perform a restart. If you do not want to restart at this time, click Close.
If you elect not to restart services as part of the Activation process, you can instead go to the Clusters tab and
restart your services at a later time. Until you restart services, the current software continues to run. This allows
you to restart your services at a time that is convenient based on your maintenance schedules or other
considerations.
Activating a new parcel also deactivates the previously active parcel (if any) for the product you have just
upgraded. However, until you restart the services, the previously active parcel displays a status of Still in use
and you cannot remove the parcel until it is no longer being used.
Note: Under some situations, additional upgrade steps may be necessary. In this case, instead of
Activate, the button will say Upgrade. This indicates that there are additional steps required to use
the parcel. When you click the Upgrade button, the upgrade wizard starts. See Upgrading CDH and
Managed Services Using Cloudera Manager on page 479.
Deactivating a Parcel
You can deactivate an active parcel; this will update Cloudera Manager to point to the previous software version,
ready to be run the next time a service is restarted. From the Parcels page, choose ClusterName or All Clusters
in the Location selector, and click the Deactivate button on an activated parcel.
To use the previous version of the software, restart your services.
Note: If you did your original installation from parcels, and there is only one version of your software
installed (that is, no packages, and no previous parcels have been activated and started) when you
attempt to restart after deactivating the current version, your roles will be stopped but will not be
able to restart.
Removing a Parcel
From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the to the right
of an Activate button and select Remove from Hosts.
Deleting a Parcel
From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the to the right
of a Distribute button and select Delete.
Troubleshooting
If you experience an error while performing parcel operations, click on the red 'X' icons on the parcel page to
display a message that will identify the source of the error.
If you have a parcel distributing but never completing, make sure you have enough free space in the parcel
download directories, as Cloudera Manager will retry to downloading and unpacking parcels even if there is
insufficient space.
Viewing Parcel Usage
The Parcel Usage page shows you which parcels are in current use in your clusters. This is particularly useful in
a large deployment where it may be difficult to keep track of what versions are installed across the cluster,
especially if some hosts were not available when you performed an installation or upgrade, or were added later.
To display the Parcel Usage page:
1. Do one of the following:
• Click in the top navigation bar
• Click Hosts in the top navigation bar and click the Parcels tab.
2. Click the Parcel Usage button.
This page only shows the usage of parcels, not components that were installed as packages. If you select a
cluster running packages (for example, a CDH 4 cluster) the cluster is not displayed, and instead you will see a
message indicating the cluster is not running parcels. If you have individual hosts running components installed
as packages, they will appear as "empty."
You can also view just the hosts running only the active parcels, or just hosts running older parcels (not the
currently active parcels) or both.
The "host map" at the right shows each host in the cluster with the status of the parcels on that host. If the
host is actually running the processes from the currently activated parcels, the host is indicated in blue. A black
square indicates that a parcel has been activated, but that all the running processes are from an earlier version
of the software. This can happen, for example, if you have not restarted a service or role after activating a new
parcel.
Move the cursor over the icon to see the rack to which the hosts are assigned. Hosts on different racks are
displayed in separate rows.
To view the exact versions of the software running on a given host, you can click on the square representing the
host. This pops up a display showing the parcel versions installed on that host.
For CDH 4.4, Impala 1.1.1, and Solr 0.9.3 or later, it will list the roles running on the selected host that are part
of the listed parcel. Clicking a role takes you to the Cloudera Manager page for that role. It also shows whether
the parcel is Active or not.
If a host is running a mix of software versions, the square representing the host is shown by a four-square icon
. When you move the cursor over that host, both the active and inactive components are shown. For example,
in the image below the older CDH parcel has been deactivated but only the HDFS service has been restarted.
2. Specify a property:
• Local Parcel Repository Path defines the path on the Cloudera Manager Server host where downloaded
parcels are stored.
• Remote Parcel Repository URLs is a list of repositories that Cloudera Manager should check for parcels.
Initially this points to the latest released CDH 4, CDH 5, Impala, and Solr repositories but you can add your
own repository locations to the list. You can use this mechanism to add Cloudera repositories that are
not listed by default, such as older versions of CDH, or the Sentry parcel for CDH 4.3. You can also use this
to add your own custom repositories. The locations of the Cloudera parcel repositories are
http://archive.cloudera.com/product/parcels/version, where product is cdh4, cdh5, gplextras5,
impala, search, and sentry, and version is a specific product version, latest, or the substitution variable
{latest_supported}. The substitution variable appears after the parcel for the CDH version with the
same major number as the Cloudera Manager version to enable substitution of the latest supported
maintenance version of CDH.
To add a parcel repository:
1. In the Remote Parcel Repository URLs list, click to open an additional row.
2. Enter the path to the repository.
Required Role:
Managing software distribution using parcels offers many advantages over packages. To migrate from packages
to the same version parcel, perform the following steps. To upgrade to a different version, see Upgrading CDH
and Managed Services Using Cloudera Manager on page 479.
If your Cloudera Manager Server does not have Internet access, you can obtain the required parcel file(s) and
put them into a repository. See Creating and Using a Remote Parcel Repository on page 137 for more details.
3. When the download has completed, click Distribute for the version you downloaded.
4. When the parcel has been distributed and unpacked, the button will change to say Activate.
5. Click Activate.
Uninstall Packages
1. If your Hue service uses the embedded SQLite DB, back up /var/lib/hue/desktop.db to a location that is
not /var/lib/hue as this directory is removed when the packages are removed.
2. Uninstall the CDH packages on each host:
• Not including Impala and Search
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Required Role:
To migrate from a parcel to the same version packages, perform the following steps. To upgrade to a different
version, see Upgrading CDH and Managed Services Using Cloudera Manager on page 479.
Install Packages
1. Choose a repository strategy:
• Standard Cloudera repositories. For this method, ensure you have added the required repository information
to your systems.
• Internally hosted repositories. You might use internal repositories for environments where hosts do not
have access to the Internet. For information about preparing your environment, see Understanding Custom
Installation Solutions on page 135. When using an internal repository, you must copy the repo or list file
to the Cloudera Manager Server host and update the repository properties to point to internal repository
URLs.
2. Install packages:
CDH Procedure
Version
CDH 5 • Red Hat
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click the entry in the table below that matches your Red Hat or CentOS system, choose
Save File, and save the file to a directory to which you have write access (for example,
your home directory).
• Red Hat/CentOS/Oracle 6
CDH Procedure
Version
• Red Hat/CentOS/Oracle 6
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
• SLES
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access
(for example, your home directory).
b. Install the RPM:
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
CDH Procedure
Version
a. Download the CDH 5 "1-click Install" package:
$ curl -s
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
CDH Procedure
Version
• Red Hat/CentOS/Oracle 5
• Red Hat/CentOS 6
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
• SLES
1. Run the following command:
CDH Procedure
Version
3. Optionally add a repository key:
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
b. Install the Solr Server on machines where you want Cloudera Search.
• Ubuntu or Debian
1. In the table at CDH Version and Packaging Information, click the entry that matches your
Ubuntu or Debian system.
2. Navigate to the list file (cloudera.list) for your system and save it in the
/etc/apt/sources.list.d/ directory. For example, to install CDH 4 for 64-bit Ubuntu
Lucid, your cloudera.list file should look like:
deb [arch=amd64]
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4
contrib
deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh
lucid-cdh4 contrib
CDH Procedure
Version
• Ubuntu Lucid
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
• Debian Squeeze
$ curl -s
http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key
| sudo apt-key add -
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
Deactivate Parcels
When you deactivate a parcel, Cloudera Manager points to the installed packages, ready to be run the next time
a service is restarted. To deactivate parcels,
Removing a Parcel
From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the to the right
of an Activate button and select Remove from Hosts.
Deleting a Parcel
From the Parcels page, choose ClusterName or All Clusters in the Location selector, and click the to the right
of a Distribute button and select Delete.
To install packages from the EPEL repository, download the appropriate repository rpm packages to your machine
and then install Python using yum. For example, use the following commands for RHEL 5 or CentOS 5:
$ su -c 'rpm -Uvh
http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
...
$ yum install python26
OS File Property
RHEL-compatible /etc/yum.conf proxy=http://server:port/
wget
http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
4. Read the Cloudera Manager README and then press Return or Enter to choose Next.
5. Read the Cloudera Express License and then press Return or Enter to choose Next. Use the arrow keys and
press Return or Enter to choose Yes to confirm you accept the license.
6. Read the Oracle Binary Code License Agreement and then press Return or Enter to choose Next.
7. Use the arrow keys and press Return or Enter to choose Yes to confirm you accept the Oracle Binary Code
License Agreement. The following occurs:
a. The installer installs the Oracle JDK and the Cloudera Manager repository files.
b. The installer installs the Cloudera Manager Server and embedded PostgreSQL packages.
c. The installer starts the Cloudera Manager Server and embedded PostgreSQL database.
8. When the installation completes, the complete URL provided for the Cloudera Manager Admin Console,
including the port number, which is 7180 by default. Press Return or Enter to choose OK to continue.
9. Press Return or Enter to choose OK to exit the installer.
Note: If the installation is interrupted for some reason, you may need to clean up before you can
re-run it. See Uninstalling Cloudera Manager and Managed Software on page 158.
Use the Cloudera Manager Wizard for Software Installation and Configuration
The following instructions describe how to use the Cloudera Manager installation wizard to do an initial installation
and configuration. The wizard lets you:
• Select the version of Cloudera Manager to install
• Find the cluster hosts you specify via hostname and IP address ranges
• Connect to each host with SSH to install the Cloudera Manager Agent and other components
• Optionally install the Oracle JDK on the cluster hosts.
• Install CDH and managed service packages or parcels
• Configure CDH and managed services automatically and start the services
Important: All hosts in the cluster must have some way to access installation files via one of the
following methods:
• Internet access to allow the wizard to install software packages or parcels from
archive.cloudera.com.
• A custom internal repository that the host(s) can access. For example, for a Red Hat host, you
could set up a Yum repository. See Creating and Using a Remote Package Repository on page 139
for more information about this option.
• Cloudera Enterprise Data Hub Edition Trial, which does not require a license, but expires after 60 days
and cannot be renewed.
• Cloudera Enterprise with one of the following license types:
– Basic Edition
– Flex Edition
– Data Hub Edition
If you choose Cloudera Express or Cloudera Enterprise Data Hub Edition Trial, you can upgrade the license
at a later time. See Managing Licenses.
2. If you elect Cloudera Enterprise, install a license:
a. Click Upload License.
b. Click the document icon to the left of the Select a License File text field.
c. Navigate to the location of your license file, click the file, and click Open.
d. Click Upload.
Click Continue to proceed with the installation.
3. Information is displayed indicating what the CDH installation includes. At this point, you can access online
Help or the Support Portal. Click Continue to proceed with the installation.
4. To enable Cloudera Manager to automatically discover hosts on which to install CDH and managed services,
enter the cluster hostnames or IP addresses. You can also specify hostname and IP address ranges. For
example:
You can specify multiple addresses and address ranges by separating them by commas, semicolons, tabs,
or blank spaces, or by placing them on separate lines. Use this technique to make more specific searches
instead of searching overly wide ranges. The scan results will include all addresses scanned, but only scans
that reach hosts running SSH will be selected for inclusion in your cluster by default. If you don't know the
IP addresses of all of the hosts, you can enter an address range that spans over unused addresses and then
deselect the hosts that do not exist (and are not discovered) later in this procedure. However, keep in mind
that wider ranges will require more time to scan.
5. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for services.
If there are a large number of hosts on your cluster, wait a few moments to allow them to be discovered and
shown in the wizard. If the search is taking too long, you can stop the scan by clicking Abort Scan. To find
additional hosts, click New Search, add the host names or IP addresses and click Search again. Cloudera
Manager scans hosts by checking for network connectivity. If there are some hosts where you want to install
services that are not shown in the list, make sure you have network connectivity between the Cloudera
Manager Server host and those hosts. Common causes of loss of connectivity are firewalls and interference
from SELinux.
6. Verify that the number of hosts shown matches the number of hosts where you want to install services.
Deselect host entries that do not exist and deselect the hosts where you do not want to install services. Click
Continue. The Select Repository screen displays.
Choose Software Installation Method and Install Software
1. Select the repository type to use for the installation: parcels or packages.
• Use Parcels
1. Choose the parcels to install. The choices you see depend on the repositories you have chosen – a
repository may contain multiple parcels. Only the parcels for the latest supported service versions are
configured by default.
You can add additional parcels for previous versions by specifying custom repositories. For example,
you can find the locations of the previous CDH 4 parcels at
http://archive.cloudera.com/cdh4/parcels/. Or, if you are installing CDH 4.3 and want to use
policy-file authorization, you can add the Sentry parcel using this mechanism.
1. To specify the parcel directory, local parcel repository, add a parcel repository, or specify the
properties of a proxy server through which parcels are downloaded, click the More Options button
and do one or more of the following:
• Parcel Directory and Local Parcel Repository Path - Specify the location of parcels on cluster
hosts and the Cloudera Manager Server host.
• Parcel Repository - In the Remote Parcel Repository URLs field, click the button and enter
the URL of the repository. The URL you specify is added to the list of repositories listed in the
Configuring Cloudera Manager Server Parcel Settings on page 88 page and a parcel is added to
the list of parcels on the Select Repository page. If you have multiple repositories configured,
you will see all the unique parcels contained in all your repositories.
• Proxy Server - Specify the properties of a proxy server.
2. Click OK. Parcels available from the configured remote parcel repository URLs are displayed in the
parcels list. If you specify a URL for a parcel version too new to be supported by the Cloudera
Manager version, the parcel is not displayed in the parcel list.
• Use Packages
1. Select the major release of CDH to install.
2. Select the specific release of CDH to install. You can choose either the latest version, a specific version,
or use a custom repository. If you specify a custom repository for a CDH version too new to be supported
by the Cloudera Manager version, Cloudera Manager will install the packages but you will not be able
to create services using those packages and will have to manually uninstall those packages and
manually reinstall packages for a supported CDH version.
3. Select the specific releases of other services to install. You can choose either the latest version or use
a custom repository. Choose None if you do not want to install that service.
2. Select the release of Cloudera Manager Agent. You can choose either the version that matches the Cloudera
Manager Server you are currently using or specify a version in a custom repository. If you opted to use custom
repositories for installation files, you can provide a GPG key URL that applies for all repositories. Click Continue.
3. Select the Install Oracle Java SE Development Kit (JDK) checkbox to allow Cloudera Manager to install the
JDK on each cluster host or leave deselected if you installed it. If checked, your local laws permit you to deploy
unlimited strength encryption, and you are running a secure cluster, select the Install Java Unlimited Strength
Encryption Policy Files checkbox. Click Continue.
4. (Optional) Select Single User Mode to configure the Cloudera Manager Agent and all service processes to run
as the same user. This mode requires extra configuration steps that must be done manually on all hosts in
the cluster. If you have not performed the steps, directory creation will fail in the installation wizard. In most
cases, you can create the directories but the steps performed by the installation wizard may have to be
continued manually. Click Continue.
5. Specify host installation properties:
• Select root or enter the user name for an account that has password-less sudo permission.
• Select an authentication method:
– If you choose password authentication, enter and confirm the password.
– If you choose public-key authentication, provide a passphrase and path to the required key files.
• You can specify an alternate SSH port. The default value is 22.
• You can specify the maximum number of host installations to run at once. The default value is 10.
Click Continue. Cloudera Manager performs the following:
• Parcels - installs the Oracle JDK and the Cloudera Manager Agent packages and starts the Agent. Click
Continue. During parcel installation, progress is indicated for the phases of the parcel installation process
in separate progress bars. If you are installing multiple parcels, you see progress bars for each parcel.
When the Continue button at the bottom of the screen turns blue, the installation process is completed.
• Packages - configures package repositories, installs the Oracle JDK, CDH and managed service and the
Cloudera Manager Agent packages, and starts the Agent. When the Continue button at the bottom of the
screen turns blue, the installation process is completed. If the installation has completed successfully on
some hosts but failed on others, you can click Continue if you want to skip installation on the failed hosts
and continue to the next screen to start configuring services on the successful hosts.
While packages are being installed, the status of installation on each host is displayed. You can click the
Details link for individual hosts to view detailed information about the installation and error messages if
installation fails on any hosts. If you click the Abort Installation button while installation is in progress, it
will halt any pending or in-progress installations and roll back any in-progress installations to a clean state.
The Abort Installation button does not affect host installations that have already completed successfully or
already failed.
6. Click Continue. The Host Inspector runs to validate the installation and provides a summary of what it finds,
including all the versions of the installed components. If the validation is successful, click Finish.
Add Services
1. In the first page of the Add Services wizard, choose the combination of services to install and whether to
install Cloudera Navigator:
• Click the radio button next to the combination of services to install:
CDH 4 CDH 5
• Core Hadoop - HDFS, MapReduce, ZooKeeper, • Core Hadoop - HDFS, YARN (includes MapReduce
Oozie, Hive, and Hue 2), ZooKeeper, Oozie, Hive, Hue, and Sqoop
• Core with HBase • Core with HBase
• Core with Impala • Core with Impala
• All Services - HDFS, MapReduce, ZooKeeper, • Core with Search
HBase, Impala, Oozie, Hive, Hue, and Sqoop • Core with Spark
• Custom Services - Any combination of services. • All Services - HDFS, YARN (includes MapReduce
2), ZooKeeper, Oozie, Hive, Hue, Sqoop, HBase,
Impala, Solr, Spark, and Key-Value Store Indexer
• Custom Services - Any combination of services.
Note: You can create a YARN service in a CDH 4 cluster, but it is not considered production
ready.
– In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default MapReduce
computation framework. Choose Custom Services to install MapReduce, or use the Add Service
functionality to add MapReduce after installation completes.
Note: In CDH 5, the MapReduce service has been deprecated. However, the MapReduce
service is fully supported for backward compatibility through the CDH 5 lifecycle.
– The Flume service can be added only after your cluster has been set up.
• If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally select the Include Cloudera
Navigator checkbox to enable Cloudera Navigator. See the Cloudera Navigator Documentation.
Click Continue.
2. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of
the hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of
hosts to which the HDFS DataNode role is assigned. You can reassign role instances if necessary.
Click a field below a role to display a dialog containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the pageable hosts
dialog.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
3. When you are satisfied with the assignments, click Continue.
4. Configure database settings:
a. Choose the database type:
• Keep the default setting of Use Embedded Database to have Cloudera Manager create and configure
required databases. Record the auto-generated passwords.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
$ su -c 'rpm -Uvh
http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
...
$ yum install python26
• Standard Cloudera repositories. For this method, ensure you have added the required repository information
to your systems. For Cloudera Manager repository locations and client repository files, see Cloudera Manager
Version and Download Information.
• Internally hosted repositories. You might use internal repositories for environments where hosts do not
have access to the Internet. For information about preparing your environment, see Understanding Custom
Installation Solutions on page 135. When using an internal repository, you must copy the repo or list file to
the Cloudera Manager Server host and update the repository properties to point to internal repository URLs.
Red Hat-compatible
1. Save the appropriate Cloudera Manager repo file (cloudera-manager.repo) for your system:
SLES
1. Run the following command:
Ubuntu or Debian
1. Save the appropriate Cloudera Manager list file (cloudera.list) for your system:
2. Copy the content of that file to the cloudera-manager.list file in the /etc/apt/sources.list.d/
directory.
3. Update your system package index by running:
OS Command
RHEL $ sudo yum install oracle-j2sdk1.7
OS Command
RHEL, if you have a yum $ sudo yum install cloudera-manager-daemons cloudera-manager-server
repo configured
RHEL,if you're manually $ sudo yum --nogpgcheck localinstall cloudera-manager-daemons-*.rpm
$ sudo yum --nogpgcheck localinstall cloudera-manager-server-*.rpm
transferring RPMs
SLES $ sudo zypper install cloudera-manager-daemons
cloudera-manager-server
(Optional) Manually Install the Oracle JDK, Cloudera Manager Agent, and CDH and Managed
Service Packages
You can use Cloudera Manager to install the Oracle JDK, Cloudera Manager Agent packages, CDH, and managed
service packages in Choose the Software Installation Type and Install Software on page 118, or you can install
them manually. To use Cloudera Manager to install the packages, you must meet the requirements described
in Cloudera Manager Deployment on page 40.
Important: If you are installing CDH and managed service software using packages and you want to
manually install Cloudera Manager Agent or CDH packages, you must manually install them both
following the procedures in this section; you cannot choose to install only one of them this way.
If you are going to use Cloudera Manager to install software, skip this section and go to Start the Cloudera
Manager Server on page 116. Otherwise, to manually install software, proceed with the steps in this section.
OS Command
RHEL, if you have a yum $ sudo yum install cloudera-manager-agent
cloudera-manager-daemons
repo configured:
RHEL, if you're manually $ sudo yum --nogpgcheck localinstall
cloudera-manager-agent-package.*.x86_64.rpm
transferring RPMs: cloudera-manager-daemons
2. On every Cloudera Manager Agent host, configure the Cloudera Manager Agent to point to the Cloudera
Manager Server by setting the following properties in the /etc/cloudera-scm-agent/config.ini
configuration file:
Property Description
server_host Name of the host where Cloudera Manager Server is running.
server_port Port on the host where Cloudera Manager Server is running.
For more information on Agent configuration options, see Agent Configuration File.
CDH Procedure
Version
CDH 5 • Red Hat
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click the entry in the table below that matches your Red Hat or CentOS system, choose
Save File, and save the file to a directory to which you have write access (for example,
your home directory).
CDH Procedure
Version
OS Version Click this Link
Red Red Hat/CentOS/Oracle 5 link
Hat/CentOS/Oracle
5
Red Red Hat/CentOS/Oracle 6 link
Hat/CentOS/Oracle
6
• Red Hat/CentOS/Oracle 6
• Red Hat/CentOS/Oracle 6
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
• SLES
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access
(for example, your home directory).
CDH Procedure
Version
b. Install the RPM:
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
$ curl -s
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
CDH Procedure
Version
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
• Red Hat/CentOS 6
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
CDH Procedure
Version
b. Navigate to the repo file for your system and save it in the /etc/yum.repos.d/
directory.
c. Install Impala and the Impala Shell on Impala machines:
• SLES
1. Run the following command:
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
CDH Procedure
Version
b. Install Impala and the Impala Shell on Impala machines:
b. Install the Solr Server on machines where you want Cloudera Search.
• Ubuntu or Debian
1. In the table at CDH Version and Packaging Information, click the entry that matches your
Ubuntu or Debian system.
2. Navigate to the list file (cloudera.list) for your system and save it in the
/etc/apt/sources.list.d/ directory. For example, to install CDH 4 for 64-bit Ubuntu
Lucid, your cloudera.list file should look like:
deb [arch=amd64]
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4
contrib
deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh
lucid-cdh4 contrib
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
• Debian Squeeze
$ curl -s
http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key
| sudo apt-key add -
CDH Procedure
Version
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
Important: Following these instructions will install the required software to add the Key Trustee KMS
service to your cluster; this enables you to use an existing Cloudera Navigator Key Trustee Server as
the underlying keystore for HDFS Data At Rest Encryption. This does not install Cloudera Navigator
Key Trustee Server. Contact Cloudera Support for Key Trustee Server documentation or assistance
deploying Key Trustee Server.
2. Add the repository to your system, using the appropriate procedure for your operating system:
• RHEL-compatible
Download the repository and copy it to the /etc/yum.repos.d/ directory. Refresh the package index by
running sudo yum clean all.
• SLES
Add the repository to your system using the following command:
3. Install the keytrustee-keyprovider package, using the appropriate command for your operating system:
• RHEL-compatible
• SLES
• Ubuntu or Debian
Important: When you start the Cloudera Manager Server and Agents, Cloudera Manager assumes
you are not already running HDFS and MapReduce. If these services are running:
1. Shut down HDFS and MapReduce. See Stopping Services (CDH 4) or Stopping Services (CDH 5) for
the commands to stop these services.
2. Configure the init scripts to not start on boot. Use commands similar to those shown in Configuring
init to Start Core Hadoop System Services (CDH 4) or Configuring init to Start Hadoop System
Services (CDH 5), but disable the start on boot (for example, $ sudo chkconfig
hadoop-hdfs-namenode off).
Contact Cloudera Support for help converting your existing Hadoop configurations for use with Cloudera
Manager.
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on
page 612.
When the Agent starts, it contacts the Cloudera Manager Server. If communication fails between a Cloudera
Manager Agent and Cloudera Manager Server, see Troubleshooting Installation and Upgrade Problems on page
612.
When the Agent hosts reboot, cloudera-scm-agent starts automatically.
If you choose Cloudera Express or Cloudera Enterprise Data Hub Edition Trial, you can upgrade the license
at a later time. See Managing Licenses.
3. If you elect Cloudera Enterprise, install a license:
a. Click Upload License.
b. Click the document icon to the left of the Select a License File text field.
c. Navigate to the location of your license file, click the file, and click Open.
d. Click Upload.
You can specify multiple addresses and address ranges by separating them by commas, semicolons,
tabs, or blank spaces, or by placing them on separate lines. Use this technique to make more specific
searches instead of searching overly wide ranges. The scan results will include all addresses scanned,
but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default.
If you don't know the IP addresses of all of the hosts, you can enter an address range that spans over
unused addresses and then deselect the hosts that do not exist (and are not discovered) later in this
procedure. However, keep in mind that wider ranges will require more time to scan.
2. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for
services. If there are a large number of hosts on your cluster, wait a few moments to allow them to
be discovered and shown in the wizard. If the search is taking too long, you can stop the scan by
clicking Abort Scan. To find additional hosts, click New Search, add the host names or IP addresses
and click Search again. Cloudera Manager scans hosts by checking for network connectivity. If there
are some hosts where you want to install services that are not shown in the list, make sure you have
network connectivity between the Cloudera Manager Server host and those hosts. Common causes
of loss of connectivity are firewalls and interference from SELinux.
3. Verify that the number of hosts shown matches the number of hosts where you want to install
services. Deselect host entries that do not exist and deselect the hosts where you do not want to
install services. Click Continue. The Select Repository screen displays.
• If you installed Cloudera Agent packages in Install Cloudera Manager Agent Packages on page 109, choose
from among hosts with the packages installed:
1. Click the Currently Managed Hosts tab.
2. Choose the hosts to add to the cluster.
6. Click Continue.
http://archive.cloudera.com/cdh4/parcels/. Or, if you are installing CDH 4.3 and want to use
policy-file authorization, you can add the Sentry parcel using this mechanism.
1. To specify the parcel directory, specify the local parcel repository, add a parcel repository, or specify
the properties of a proxy server through which parcels are downloaded, click the More Options
button and do one or more of the following:
• Parcel Directory and Local Parcel Repository Path - Specify the location of parcels on cluster
hosts and the Cloudera Manager Server host. If you change the default value for Parcel Directory
and have already installed and started Cloudera Manager Agents, restart the Agents:
• Parcel Repository - In the Remote Parcel Repository URLs field, click the button and enter
the URL of the repository. The URL you specify is added to the list of repositories listed in the
Configuring Cloudera Manager Server Parcel Settings on page 88 page and a parcel is added to
the list of parcels on the Select Repository page. If you have multiple repositories configured,
you see all the unique parcels contained in all your repositories.
• Proxy Server - Specify the properties of a proxy server.
2. Click OK.
2. Select the release of Cloudera Manager Agent. You can choose either the version that matches the
Cloudera Manager Server you are currently using or specify a version in a custom repository. If you
opted to use custom repositories for installation files, you can provide a GPG key URL that applies for
all repositories. Click Continue.
• Use Packages - Do one of the following:
– If Cloudera Manager is installing the packages:
1. Click the package version.
2. Select the release of Cloudera Manager Agent. You can choose either the version that matches the
Cloudera Manager Server you are currently using or specify a version in a custom repository. If you
opted to use custom repositories for installation files, you can provide a GPG key URL that applies
for all repositories. Click Continue.
– If you manually installed packages in Install CDH and Managed Service Packages on page 109, select
the CDH version (CDH 4 or CDH 5) that matches the packages you installed manually.
2. Select the Install Oracle Java SE Development Kit (JDK) checkbox to allow Cloudera Manager to install the
JDK on each cluster host or leave deselected if you installed it. If checked, your local laws permit you to deploy
unlimited strength encryption, and you are running a secure cluster, select the Install Java Unlimited Strength
Encryption Policy Files checkbox. Click Continue.
3. (Optional) Select Single User Mode to configure the Cloudera Manager Agent and all service processes to run
as the same user. This mode requires extra configuration steps that must be done manually on all hosts in
the cluster. If you have not performed the steps, directory creation will fail in the installation wizard. In most
cases, you can create the directories but the steps performed by the installation wizard may have to be
continued manually. Click Continue.
4. If you chose to have Cloudera Manager install software, specify host installation properties:
• Select root or enter the user name for an account that has password-less sudo permission.
• Select an authentication method:
– If you choose password authentication, enter and confirm the password.
– If you choose public-key authentication, provide a passphrase and path to the required key files.
• You can specify an alternate SSH port. The default value is 22.
• You can specify the maximum number of host installations to run at once. The default value is 10.
5. Click Continue. If you chose to have Cloudera Manager install software, Cloudera Manager installs the Oracle
JDK, Cloudera Manager Agent, packages and CDH and managed service parcels or packages. During parcel
installation, progress is indicated for the phases of the parcel installation process in separate progress bars.
If you are installing multiple parcels, you see progress bars for each parcel. When the Continue button at the
bottom of the screen turns blue, the installation process is completed.
6. Click Continue. The Host Inspector runs to validate the installation and provides a summary of what it finds,
including all the versions of the installed components. If the validation is successful, click Finish.
Add Services
Use the Cloudera Manager wizard to configure and start CDH and managed services.
1. In the first page of the Add Services wizard, choose the combination of services to install and whether to
install Cloudera Navigator:
• Click the radio button next to the combination of services to install:
CDH 4 CDH 5
• Core Hadoop - HDFS, MapReduce, ZooKeeper, • Core Hadoop - HDFS, YARN (includes MapReduce
Oozie, Hive, and Hue 2), ZooKeeper, Oozie, Hive, Hue, and Sqoop
• Core with HBase • Core with HBase
• Core with Impala • Core with Impala
• All Services - HDFS, MapReduce, ZooKeeper, • Core with Search
HBase, Impala, Oozie, Hive, Hue, and Sqoop • Core with Spark
• Custom Services - Any combination of services. • All Services - HDFS, YARN (includes MapReduce
2), ZooKeeper, Oozie, Hive, Hue, Sqoop, HBase,
Impala, Solr, Spark, and Key-Value Store Indexer
• Custom Services - Any combination of services.
Note: You can create a YARN service in a CDH 4 cluster, but it is not considered production
ready.
– In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default MapReduce
computation framework. Choose Custom Services to install MapReduce, or use the Add Service
functionality to add MapReduce after installation completes.
Note: In CDH 5, the MapReduce service has been deprecated. However, the MapReduce
service is fully supported for backward compatibility through the CDH 5 lifecycle.
– The Flume service can be added only after your cluster has been set up.
• If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally select the Include Cloudera
Navigator checkbox to enable Cloudera Navigator. See the Cloudera Navigator Documentation.
Click Continue.
2. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of
the hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of
hosts to which the HDFS DataNode role is assigned. You can reassign role instances if necessary.
Click a field below a role to display a dialog containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the pageable hosts
dialog.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
3. When you are satisfied with the assignments, click Continue.
4. On the Database Setup page, configure settings for required databases:
a. Enter the database host, database type, database name, username, and password for the database that
you created when you set up the database.
b. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the
information you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct
the information you have provided for the database and then try the test again. (For some servers, if you
are using the embedded database, you will see a message saying the database will be created at a later
step in the installation process.) The Review Changes screen displays.
5. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file
paths required vary based on the services to be installed. If you chose to add the Sqoop service, indicate
whether to use the default Derby database or the embedded PostgreSQL database. If the latter, type the
database name, host, and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
When you have a directory to which to extract the contents of the tarball, extract the contents. For example, to
copy a tar file to your home directory and extract the contents of all tar files to the /opt/ directory, use a
command similar to the following:
The files are extracted to a subdirectory named according to the Cloudera Manager version being extracted. For
example, files could extract to /opt/cloudera-manager/cm-5.0/. This full path is needed later and is referred
to as tarball_root directory.
Create Users
The Cloudera Manager Server and managed services require a user account to complete tasks. When installing
Cloudera Manager from tarballs, you must create this user account on all hosts manually. Because Cloudera
Manager Server and managed services are configured to use the user account cloudera-scm by default, creating
a user with this name is the simplest approach. This created user, is used automatically after installation is
complete.
To create user cloudera-scm, use a command such as the following:
Ensure the --home argument path matches your environment. This argument varies according to where you
place the tarball, and the version number varies among releases. For example, the --home location could be
/opt/cm-5.0/run/cloudera-scm-server.
Property Description
server_host Name of the host where Cloudera Manager Server is running.
server_port Port on the host where Cloudera Manager Server is running.
• By default, a tarball installation has a var subdirectory where state is stored. In a non-tarball installation,
state is stored in /var. Cloudera recommends that you reconfigure the tarball installation to use an external
directory as the /var equivalent (/var or any other directory outside the tarball) so that when you upgrade
Cloudera Manager, the new tarball installation can access this state. Configure the installation to use an
external directory for storing state by editing tarball_root/etc/default/cloudera-scm-agent and setting
the CMF_VAR variable to the location of the /var equivalent. If you do not reuse the state directory between
different tarball installations, duplicate Cloudera Manager Agent entries can occur in the Cloudera Manager
database.
If you are using a custom username and custom directories for Cloudera Manager, you must create these
directories on the Cloudera Manager Server host and assign ownership of these directories to the custom
username. Cloudera Manager installer makes no changes to any directories that already exist. Cloudera Manager
cannot write to any existing directories for which it does not have proper permissions, and if you do not change
ownership, Cloudera Management Service roles may not perform as expected. To resolve these issues, do one
of the following:
mkdir /var/cm_logs/cloudera-scm-headlamp
chown cloudera-scm /var/cm_logs/cloudera-scm-headlamp
Note: The configuration property for the Cloudera Manager Server Local Data Storage Directory
(default value is: /var/lib/cloudera-scm-server) is located on a different page:
1. Select Administration > Settings.
2. Type directory in the Search box.
3. Enter the directory path in the Cloudera Manager Server Local Data Storage Directory
property.
2. Change the directory ownership to be the username you are using to run Cloudera Manager:
where username and groupname are the user and group names (respectively) you are using to run Cloudera
Manager. For example, if you use the default username cloudera-scm, you would run the command:
4. Change the directory ownership to be the username you are using to run Cloudera Manager:
where username and groupname are the user and group names (respectively) you are using to run Cloudera
Manager. For example, if you use the default username cloudera-scm, you would run the command:
Important: When you start the Cloudera Manager Server and Agents, Cloudera Manager assumes
you are not already running HDFS and MapReduce. If these services are running:
1. Shut down HDFS and MapReduce. See Stopping Services (CDH 4) or Stopping Services (CDH 5) for
the commands to stop these services.
2. Configure the init scripts to not start on boot. Use commands similar to those shown in Configuring
init to Start Core Hadoop System Services (CDH 4) or Configuring init to Start Hadoop System
Services (CDH 5), but disable the start on boot (for example, $ sudo chkconfig
hadoop-hdfs-namenode off).
Contact Cloudera Support for help converting your existing Hadoop configurations for use with Cloudera
Manager.
The way in which you start the Cloudera Manager Server varies according to what account you want the Server
to run under:
• As root:
• As another user. If you run as another user, ensure the user you created for Cloudera Manager owns the
location to which you extracted the tarball including the newly created database files. If you followed the
earlier examples and created the directory /opt/cloudera-manager and the user cloudera-scm, you could
use the following command to change ownership of the directory:
Once you have established ownership of directory locations, you can start Cloudera Manager Server using
the user account you chose. For example, you might run the Cloudera Manager Server as cloudera-service.
In this case, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user. Then run the script as root:
USER=cloudera-service
GROUP=cloudera-service
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ chkconfig cloudera-scm-server on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ update-rc.d cloudera-scm-server defaults
2. On the Cloudera Manager Server host, open the /etc/init.d/cloudera-scm-server file and change
the value of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on page
612.
– Edit the configuration files so the script internally changes the user, and then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
USER=cloudera-scm
GROUP=cloudera-scm
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ chkconfig cloudera-scm-agent on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ update-rc.d cloudera-scm-agent defaults
2. On each Agent, open the tarball_root/etc/init.d/cloudera-scm-agent file and change the value
of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
Install Dependencies
When you install with tarballs and parcels, some services may require additional dependencies that are not
provided by Cloudera. On each host, install the required packages:
• Red-hat compatible
– chkconfig
– python (2.6 required for CDH 5)
– bind-utils
– psmisc
– libxslt
– zlib
– sqlite
– cyrus-sasl-plain
– cyrus-sasl-gssapi
– fuse
– portmap
– fuse-libs
– redhat-lsb
• SLES
– chkconfig
– python (2.6 required for CDH 5)
– bind-utils
– psmisc
– libxslt
– zlib
– sqlite
– cyrus-sasl-plain
– cyrus-sasl-gssapi
– fuse
– portmap
– python-xml
– libfuse2
• Debian/Ubuntu
– lsb-base
– psmisc
– bash
– libsasl2-modules
– libsasl2-modules-gssapi-mit
– zlib1g
– libxslt1.1
– libsqlite3-0
– libfuse2
– fuse-utils or fuse
– rpcbind
If you choose Cloudera Express or Cloudera Enterprise Data Hub Edition Trial, you can upgrade the license
at a later time. See Managing Licenses.
3. If you elect Cloudera Enterprise, install a license:
a. Click Upload License.
b. Click the document icon to the left of the Select a License File text field.
c. Navigate to the location of your license file, click the file, and click Open.
d. Click Upload.
Click Continue to proceed with the installation.
4. Information is displayed indicating what the CDH installation includes. At this point, you can access online
Help or the Support Portal. Click Continue to proceed with the installation.
5. Click the Currently Managed Hosts tab.
6. Choose the hosts to add to the cluster.
7. Click Continue.
• Parcel Repository - In the Remote Parcel Repository URLs field, click the button and enter
the URL of the repository. The URL you specify is added to the list of repositories listed in the
Configuring Cloudera Manager Server Parcel Settings on page 88 page and a parcel is added to
the list of parcels on the Select Repository page. If you have multiple repositories configured,
you see all the unique parcels contained in all your repositories.
• Proxy Server - Specify the properties of a proxy server.
2. Click OK.
b. Select the release of Cloudera Manager Agent. You can choose either the version that matches the
Cloudera Manager Server you are currently using or specify a version in a custom repository. If you
opted to use custom repositories for installation files, you can provide a GPG key URL that applies for
all repositories. Click Continue.
b. Click Continue. Cloudera Manager installs the CDH and managed service parcels. During parcel installation,
progress is indicated for the phases of the parcel installation process in separate progress bars. If you
are installing multiple parcels, you see progress bars for each parcel. When the Continue button at the
bottom of the screen turns blue, the installation process is completed. Click Continue.
2. Click Continue. The Host Inspector runs to validate the installation and provides a summary of what it finds,
including all the versions of the installed components. If the validation is successful, click Finish.
Add Services
Use the Cloudera Manager wizard to configure and start CDH and managed services.
1. In the first page of the Add Services wizard, choose the combination of services to install and whether to
install Cloudera Navigator:
• Click the radio button next to the combination of services to install:
CDH 4 CDH 5
• Core Hadoop - HDFS, MapReduce, ZooKeeper, • Core Hadoop - HDFS, YARN (includes MapReduce
Oozie, Hive, and Hue 2), ZooKeeper, Oozie, Hive, Hue, and Sqoop
• Core with HBase • Core with HBase
• Core with Impala • Core with Impala
• All Services - HDFS, MapReduce, ZooKeeper, • Core with Search
HBase, Impala, Oozie, Hive, Hue, and Sqoop • Core with Spark
• Custom Services - Any combination of services. • All Services - HDFS, YARN (includes MapReduce
2), ZooKeeper, Oozie, Hive, Hue, Sqoop, HBase,
Impala, Solr, Spark, and Key-Value Store Indexer
• Custom Services - Any combination of services.
Note: You can create a YARN service in a CDH 4 cluster, but it is not considered production
ready.
– In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default MapReduce
computation framework. Choose Custom Services to install MapReduce, or use the Add Service
functionality to add MapReduce after installation completes.
Note: In CDH 5, the MapReduce service has been deprecated. However, the MapReduce
service is fully supported for backward compatibility through the CDH 5 lifecycle.
– The Flume service can be added only after your cluster has been set up.
• If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally select the Include Cloudera
Navigator checkbox to enable Cloudera Navigator. See the Cloudera Navigator Documentation.
Click Continue.
2. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of
the hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of
hosts to which the HDFS DataNode role is assigned. You can reassign role instances if necessary.
Click a field below a role to display a dialog containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the pageable hosts
dialog.
The following shortcuts for specifying hostname patterns are supported:
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
3. When you are satisfied with the assignments, click Continue.
4. On the Database Setup page, configure settings for required databases:
a. Enter the database host, database type, database name, username, and password for the database that
you created when you set up the database.
b. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the
information you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct
the information you have provided for the database and then try the test again. (For some servers, if you
are using the embedded database, you will see a message saying the database will be created at a later
step in the installation process.) The Review Changes screen displays.
5. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file
paths required vary based on the services to be installed. If you chose to add the Sqoop service, indicate
whether to use the default Derby database or the embedded PostgreSQL database. If the latter, type the
database name, host, and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
Installing Impala
Cloudera Impala is included with CDH 5. In a parcel-based configuration, it is part of the CDH parcel rather than
a separate parcel. Starting with CDH 5.4 (corresponding to Impala 2.2 in the Impala versioning scheme) new
releases of Impala are only available on CDH 5, not CDH 4.
Although these installation instructions primarily focus on CDH 5, you can also manage CDH 4 clusters using
Cloudera Manager 5. In CDH 4, Impala has packages and parcels that you download and install separately from
CDH. To use Cloudera Impala with CDH 4, you must install both CDH and Impala on the hosts that will run Impala.
Note:
• See Supported CDH and Managed Service Versions on page 8 for supported versions.
• Before proceeding, review the installation options described in Cloudera Manager Deployment on
page 40.
Installing Search
Cloudera Search is provided by the Solr service. The Solr service is included with CDH 5. To use Cloudera Search
with CDH 4, you must install both CDH and Search on the hosts that will run Search.
Note:
• See Supported CDH and Managed Service Versions on page 8 for supported versions.
• Before proceeding, review the installation options described in Cloudera Manager Deployment on
page 40.
Installing Spark
Apache Spark is included with CDH 5. To use Apache Spark with CDH 4, you must install both CDH and Spark on
the hosts that will run Spark.
Note:
• See Supported CDH and Managed Service Versions on page 8 for supported versions.
• Before proceeding, review the installation options described in Cloudera Manager Deployment on
page 40.
Note: The KMS (Navigator Key Trustee) service in Cloudera Manager 5.3.x is renamed to Key Trustee
KMS in Cloudera Manager 5.4.x.
Key Trustee KMS is a custom Key Management Service (KMS) that uses Cloudera Navigator Key Trustee Server
as the underlying keystore, rather than the file-based Java KeyStore (JKS) used by the default Hadoop KMS.
To use the Key Trustee KMS service, you must first install the Key Trustee KMS binaries.
Note:
• See Supported CDH and Managed Service Versions on page 8 for supported versions.
• Before proceeding, review the installation options described in Cloudera Manager Deployment on
page 40.
Important: Following these instructions will install the required software to add the Key Trustee KMS
service to your cluster; this enables you to use an existing Cloudera Navigator Key Trustee Server as
the underlying keystore for HDFS Data At Rest Encryption. This does not install Cloudera Navigator
Key Trustee Server. Contact Cloudera Support for Key Trustee Server documentation or assistance
deploying Key Trustee Server.
If you have upgraded Cloudera Manager from a version that did not support Key Trustee KMS, the Key Trustee
KMS binaries are not installed automatically. (Upgrading Cloudera Manager does not automatically upgrade
CDH or other managed services). You can add the Key Trustee KMS binaries using parcels; go to the Hosts tab,
and select the Parcels tab. You should see at least one Key Trustee KMS parcel named KEYTRUSTEE available
for download. See Parcels on page 80 for detailed instructions on using parcels to install or upgrade the Key
Trustee KMS. If you do not see any Key Trustee KMS parcels available, click the Edit Settings button on the
Parcels page to go to the Parcel configuration settings and verify that the Key Trustee parcel repo URL
(http://archive.cloudera.com/navigator-keytrustee5/parcels/latest/) has been configured in the Parcels
configuration page. See Parcel Configuration Settings on page 88 for more details.
If your cluster is installed using packages, see (Optional) Install Key Trustee KMS on page 115 for instructions on
how to install the required software.
To create the repository URL, append the version directory to the URL (CDH 4)
http://archive.cloudera.com/gplextras/parcels/ or (CDH 5)
http://archive.cloudera.com/gplextras5/parcels/ respectively. For example:
http://archive.cloudera.com/gplextras5/parcels/5.0.2.
2. Download, distribute, and activate the parcel.
3. If not already installed, on all cluster hosts, install the lzo package on RHEL or the liblzo2-2 package on
SLES, Debian, or Ubuntu:
RedHat:
Debian or Ubuntu:
SLES:
In both of these cases, using a custom repository solution allows you to meet the needs of your organization,
whether that means installing older versions of Cloudera software or installing any version of Cloudera software
on hosts that are disconnected from the Internet.
Understanding Parcels
Parcels are a packaging format that facilitate upgrading software from within Cloudera Manager. You can
download, distribute, and activate a new software version all from within Cloudera Manager. Cloudera Manager
downloads a parcel to a local directory. Once the parcel is downloaded to the Cloudera Manager Server host, an
Internet connection is no longer needed to deploy the parcel. Parcels are available for CDH 4.1.3 and onwards.
For detailed information about parcels, see Parcels on page 80.
If your Cloudera Manager Server does not have Internet access, you can obtain the required parcel files and put
them into a parcel repository. See Creating and Using a Remote Parcel Repository on page 137.
Package Repositories
Package management tools operate on package repositories.
The .repo files contain pointers to one or many repositories. There are similar pointers inside configuration
files for zypper and apt-get. In the following snippet from CentOS-Base.repo, there are two repositories
defined: one named Base and one named Updates. The mirrorlist parameter points to a website that has a
list of places where this repository can be downloaded.
# ...
[base]
name=CentOS-$releasever - Base
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
#baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
#released updates
[updates]
name=CentOS-$releasever - Updates
mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates
#baseurl=http://mirror.centos.org/centos/$releasever/updates/$basearch/
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-5
# ...
Listing Repositories
You can list the repositories you have enabled. The command varies according to operating system:
• RedHat/CentOS - yum repolist
• SLES - zypper repos
• Debian/Ubuntu - apt-get does not include a command to display sources, but you can determine sources
by reviewing the contents of /etc/apt/sources.list and any files contained in
/etc/apt/sources.list.d/.
The following shows an example of what you might find on a CentOS system in repolist:
OS Command
RHEL [root@localhost yum.repos.d]$ yum install httpd
OS Command
RHEL [root@localhost tmp]$ service httpd start
Starting httpd: [ OK
]
2. Move the .parcel and manifest.json files to the web server directory, and modify file permissions. For
example, you might use the following commands:
After moving the files and changing permissions, visit http://hostname:80/cdh4.6/ to verify that you
can access the parcel. Apache may have been configured to not show indexes, which is also acceptable.
Creating a Temporary Remote Repository
You can quickly create a temporary local repository to deploy a parcel once. It is convenient to perform this on
the same host that runs Cloudera Manager, or a gateway role. In this example,python SimpleHTTPServer is used
from a directory of your choosing.
1. Download the patched .parcel and manifest.json files as provided in a secure link from Cloudera Support.
2. Copy the .parcel and manifest.json files to a location of your choosing on your server. This is the directory
from which the python SimpleHTTPServer will serve the files. For example:
$ mkdir /tmp/parcel
$ cp /home/user/Downloads/patchparcel/CDH-4.6.0.p234.parcel /tmp/parcel/
$ cp /home/user/Downloads/patchparcel/manifest.json /tmp/parcel/
3. Determine a port that your system is not listening on (for example, port 8900).
4. Change to the directory containing the .parcel and manifest.json files.
$ cd /tmp/parcel
6. Confirm you can get to this hosted parcel directory by going to http://server:8900 in your browser. You
should see links for the hosted files.
Configuring the Cloudera Manager Server to Use the Parcel URL
1. Use one of the following methods to open the parcel settings page:
• Navigation bar
1. Click in the top navigation bar
2. Click the Edit Settings button.
• Menu
1. Select Administration > Settings.
2. Select Category Parcels.
2. In the Remote Parcel Repository URLs list, click to open an additional row.
3. Enter the path to the parcel. For example, http://hostname:port/cdh4.6/.
4. Click Save Changes to commit the changes.
OS Command
RHEL [root@localhost yum.repos.d]$ yum install httpd
OS Command
RHEL [root@localhost tmp]$ service httpd start
Starting httpd: [ OK
]
After moving files and changing permissions, visit http://hostname:port/cm to verify that you see an
index of files. Apache may have been configured to not show indexes, which is also acceptable.
Creating a Temporary Remote Repository
You can quickly create a temporary remote repository to deploy a package once. It is convenient to perform this
on the same host that runs Cloudera Manager, or a gateway role. In this example,python SimpleHTTPServer is
used from a directory of your choosing.
1. Download the tarball for your OS distribution from the repo as tarball archive.
2. Unpack the tarball and modify file permissions. For example, you might use the following commands:
3. Determine a port that your system is not listening on (for example, port 8900).
4. Change to the directory containing the files.
$ cd /tmp/cm
6. Confirm you can get to this hosted package directory by going to http://server:8900/cm in your browser.
You should see links for the hosted files.
OS Command
RHEL Create files on client systems with the following information and format, where
hostname is the name of the web server:
[myrepo]
name=myrepo
baseurl=http://hostname/cm/5
enabled=1
gpgcheck=0
See man yum.conf for more details. Put that file into
/etc/yum.repos.d/myrepo.repo on all of your hosts to enable them to find the
packages that you are hosting.
SLES Use the zypper utility to update client system repo information by issuing the
following command:
$ zypper addrepo http://hostname/cm alias
Ubuntu or Debian Add a new list file to /etc/apt/sources.list.d/ on client systems. For example,
you might create the file
/etc/apt/sources.list.d/my-private-cloudera-repo.list. In that file,
create an entry to your newly created repository. For example:
$ cat /etc/apt/sources.list.d/my-private-cloudera-repo.list
deb http://hostname/cm cloudera
After adding your .list file, ensure apt-get uses the latest information by issuing
the following command:
$ sudo apt-get update
After completing these steps, you have established the environment necessary to install a previous version of
Cloudera Manager or install Cloudera Manager to hosts that are not connected to the Internet. Proceed with
the installation process, being sure to target the newly created repository with your package management tool.
$ su -c 'rpm -Uvh
http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
...
$ yum install python26
2. Edit the file to change the second-to-last element to specify the version of Cloudera Manager you want
to install. For example, with Ubuntu lucid, if you want to install Cloudera Manager version 5.0.1, change:
deb http://archive.cloudera.com/cm5/ubuntu/lucid/amd64/cm lucid-cm5 contrib to deb
http://archive.cloudera.com/cm5/ubuntu/lucid/amd64/cm lucid-cm5.0.1 contrib.
3. Save the edited file in the directory /etc/apt/sources.list.d/.
OS Command
RHEL $ sudo yum install oracle-j2sdk1.7
OS Command
RHEL, if you have a yum $ sudo yum install cloudera-manager-daemons cloudera-manager-server
repo configured
RHEL,if you're manually $ sudo yum --nogpgcheck localinstall cloudera-manager-daemons-*.rpm
$ sudo yum --nogpgcheck localinstall cloudera-manager-server-*.rpm
transferring RPMs
SLES $ sudo zypper install cloudera-manager-daemons
cloudera-manager-server
Important: If you are installing CDH and managed service software using packages and you want to
manually install Cloudera Manager Agent or CDH packages, you must manually install them both
following the procedures in this section; you cannot choose to install only one of them this way.
If you are going to use Cloudera Manager to install software, skip this section and go to Start the Cloudera
Manager Server on page 116. Otherwise, to manually install software, proceed with the steps in this section.
OS Command
RHEL, if you have a yum $ sudo yum install cloudera-manager-agent
cloudera-manager-daemons
repo configured:
RHEL, if you're manually $ sudo yum --nogpgcheck localinstall
cloudera-manager-agent-package.*.x86_64.rpm
transferring RPMs: cloudera-manager-daemons
2. On every Cloudera Manager Agent host, configure the Cloudera Manager Agent to point to the Cloudera
Manager Server by setting the following properties in the /etc/cloudera-scm-agent/config.ini
configuration file:
Property Description
server_host Name of the host where Cloudera Manager Server is running.
server_port Port on the host where Cloudera Manager Server is running.
For more information on Agent configuration options, see Agent Configuration File.
CDH Procedure
Version
CDH 5 • Red Hat
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click the entry in the table below that matches your Red Hat or CentOS system, choose
Save File, and save the file to a directory to which you have write access (for example,
your home directory).
• Red Hat/CentOS/Oracle 6
• Red Hat/CentOS/Oracle 6
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
• SLES
CDH Procedure
Version
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access
(for example, your home directory).
b. Install the RPM:
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
CDH Procedure
Version
• Debian Wheezy
$ curl -s
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
Note: Installing these packages also installs all the other CDH packages required
for a full CDH 5 installation.
• Red Hat/CentOS 6
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
CDH Procedure
Version
5. (Requires CDH 4.2 or later) Install Impala
a. In the table at Cloudera Impala Version and Download Information, click the entry that
matches your Red Hat or CentOS system.
b. Navigate to the repo file for your system and save it in the /etc/yum.repos.d/
directory.
c. Install Impala and the Impala Shell on Impala machines:
• SLES
1. Run the following command:
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
CDH Procedure
Version
a. Run the following command:
b. Install the Solr Server on machines where you want Cloudera Search.
• Ubuntu or Debian
1. In the table at CDH Version and Packaging Information, click the entry that matches your
Ubuntu or Debian system.
2. Navigate to the list file (cloudera.list) for your system and save it in the
/etc/apt/sources.list.d/ directory. For example, to install CDH 4 for 64-bit Ubuntu
Lucid, your cloudera.list file should look like:
deb [arch=amd64]
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4
contrib
deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh
lucid-cdh4 contrib
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
• Debian Squeeze
$ curl -s
http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key
| sudo apt-key add -
CDH Procedure
Version
a. Install CDH 4 packages:
b. To install the hue-common package and all Hue applications on the Hue host, install the
hue meta-package:
Important: Following these instructions will install the required software to add the Key Trustee KMS
service to your cluster; this enables you to use an existing Cloudera Navigator Key Trustee Server as
the underlying keystore for HDFS Data At Rest Encryption. This does not install Cloudera Navigator
Key Trustee Server. Contact Cloudera Support for Key Trustee Server documentation or assistance
deploying Key Trustee Server.
2. Add the repository to your system, using the appropriate procedure for your operating system:
• RHEL-compatible
Download the repository and copy it to the /etc/yum.repos.d/ directory. Refresh the package index by
running sudo yum clean all.
• SLES
Add the repository to your system using the following command:
3. Install the keytrustee-keyprovider package, using the appropriate command for your operating system:
• RHEL-compatible
• SLES
• Ubuntu or Debian
Important: When you start the Cloudera Manager Server and Agents, Cloudera Manager assumes
you are not already running HDFS and MapReduce. If these services are running:
1. Shut down HDFS and MapReduce. See Stopping Services (CDH 4) or Stopping Services (CDH 5) for
the commands to stop these services.
2. Configure the init scripts to not start on boot. Use commands similar to those shown in Configuring
init to Start Core Hadoop System Services (CDH 4) or Configuring init to Start Hadoop System
Services (CDH 5), but disable the start on boot (for example, $ sudo chkconfig
hadoop-hdfs-namenode off).
Contact Cloudera Support for help converting your existing Hadoop configurations for use with Cloudera
Manager.
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on
page 612.
(Optional) Start the Cloudera Manager Agents
If you are going to use Cloudera Manager to install Cloudera Manager Agent packages, skip this section and go
to Start and Log into the Cloudera Manager Admin Console on page 117. Otherwise, run this command on each
Agent host:
When the Agent starts, it contacts the Cloudera Manager Server. If communication fails between a Cloudera
Manager Agent and Cloudera Manager Server, see Troubleshooting Installation and Upgrade Problems on page
612.
When the Agent hosts reboot, cloudera-scm-agent starts automatically.
Start and Log into the Cloudera Manager Admin Console
The Cloudera Manager Server URL takes the following form http://Server host:port, where Server host is
the fully qualified domain name or IP address of the host where the Cloudera Manager Server is installed, and
port is the port configured for the Cloudera Manager Server. The default port is 7180.
1. Wait several minutes for the Cloudera Manager Server to complete its startup. To observe the startup process,
run tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log on the Cloudera Manager
Server host. If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade
Problems on page 612.
2. In a web browser, enter http://Server host:7180, where Server host is the fully qualified domain name
or IP address of the host where the Cloudera Manager Server is running. The login screen for Cloudera
Manager Admin Console displays.
3. Log into Cloudera Manager Admin Console. The default credentials are: Username: admin Password: admin.
Cloudera Manager does not support changing the admin username for the installed account. You can change
the password using Cloudera Manager after you run the installation wizard. Although you cannot change
the admin username, you can add a new user, assign administrative privileges to the new user, and then
delete the default admin account.
Choose Cloudera Manager Edition and Hosts
Choose which edition of Cloudera Manager you are using and which hosts will run CDH and managed services.
1. When you start the Cloudera Manager Admin Console, the install wizard starts up. Click Continue to get
started.
2. Choose which edition to install:
• Cloudera Express, which does not require a license, but provides a limited set of features.
• Cloudera Enterprise Data Hub Edition Trial, which does not require a license, but expires after 60 days
and cannot be renewed.
• Cloudera Enterprise with one of the following license types:
– Basic Edition
– Flex Edition
– Data Hub Edition
If you choose Cloudera Express or Cloudera Enterprise Data Hub Edition Trial, you can upgrade the license
at a later time. See Managing Licenses.
3. If you elect Cloudera Enterprise, install a license:
a. Click Upload License.
b. Click the document icon to the left of the Select a License File text field.
c. Navigate to the location of your license file, click the file, and click Open.
d. Click Upload.
You can specify multiple addresses and address ranges by separating them by commas, semicolons,
tabs, or blank spaces, or by placing them on separate lines. Use this technique to make more specific
searches instead of searching overly wide ranges. The scan results will include all addresses scanned,
but only scans that reach hosts running SSH will be selected for inclusion in your cluster by default.
If you don't know the IP addresses of all of the hosts, you can enter an address range that spans over
unused addresses and then deselect the hosts that do not exist (and are not discovered) later in this
procedure. However, keep in mind that wider ranges will require more time to scan.
2. Click Search. Cloudera Manager identifies the hosts on your cluster to allow you to configure them for
services. If there are a large number of hosts on your cluster, wait a few moments to allow them to
be discovered and shown in the wizard. If the search is taking too long, you can stop the scan by
clicking Abort Scan. To find additional hosts, click New Search, add the host names or IP addresses
and click Search again. Cloudera Manager scans hosts by checking for network connectivity. If there
are some hosts where you want to install services that are not shown in the list, make sure you have
network connectivity between the Cloudera Manager Server host and those hosts. Common causes
of loss of connectivity are firewalls and interference from SELinux.
3. Verify that the number of hosts shown matches the number of hosts where you want to install
services. Deselect host entries that do not exist and deselect the hosts where you do not want to
install services. Click Continue. The Select Repository screen displays.
• If you installed Cloudera Agent packages in Install Cloudera Manager Agent Packages on page 109, choose
from among hosts with the packages installed:
1. Click the Currently Managed Hosts tab.
2. Choose the hosts to add to the cluster.
6. Click Continue.
Choose the Software Installation Type and Install Software
Choose a software installation type (parcels or packages) and install the software if not previously installed.
1. Choose the software installation type and CDH and managed service version:
• Use Parcels
1. Choose the parcels to install. The choices depend on the repositories you have chosen; a repository
can contain multiple parcels. Only the parcels for the latest supported service versions are configured
by default.
You can add additional parcels for previous versions by specifying custom repositories. For example,
you can find the locations of the previous CDH 4 parcels at
http://archive.cloudera.com/cdh4/parcels/. Or, if you are installing CDH 4.3 and want to use
policy-file authorization, you can add the Sentry parcel using this mechanism.
1. To specify the parcel directory, specify the local parcel repository, add a parcel repository, or specify
the properties of a proxy server through which parcels are downloaded, click the More Options
button and do one or more of the following:
• Parcel Directory and Local Parcel Repository Path - Specify the location of parcels on cluster
hosts and the Cloudera Manager Server host. If you change the default value for Parcel Directory
and have already installed and started Cloudera Manager Agents, restart the Agents:
• Parcel Repository - In the Remote Parcel Repository URLs field, click the button and enter
the URL of the repository. The URL you specify is added to the list of repositories listed in the
Configuring Cloudera Manager Server Parcel Settings on page 88 page and a parcel is added to
the list of parcels on the Select Repository page. If you have multiple repositories configured,
you see all the unique parcels contained in all your repositories.
• Proxy Server - Specify the properties of a proxy server.
2. Click OK.
2. Select the release of Cloudera Manager Agent. You can choose either the version that matches the
Cloudera Manager Server you are currently using or specify a version in a custom repository. If you
opted to use custom repositories for installation files, you can provide a GPG key URL that applies for
all repositories. Click Continue.
• Use Packages - Do one of the following:
– If Cloudera Manager is installing the packages:
1. Click the package version.
2. Select the release of Cloudera Manager Agent. You can choose either the version that matches the
Cloudera Manager Server you are currently using or specify a version in a custom repository. If you
opted to use custom repositories for installation files, you can provide a GPG key URL that applies
for all repositories. Click Continue.
– If you manually installed packages in Install CDH and Managed Service Packages on page 109, select
the CDH version (CDH 4 or CDH 5) that matches the packages you installed manually.
2. Select the Install Oracle Java SE Development Kit (JDK) checkbox to allow Cloudera Manager to install the
JDK on each cluster host or leave deselected if you installed it. If checked, your local laws permit you to deploy
unlimited strength encryption, and you are running a secure cluster, select the Install Java Unlimited Strength
Encryption Policy Files checkbox. Click Continue.
3. (Optional) Select Single User Mode to configure the Cloudera Manager Agent and all service processes to run
as the same user. This mode requires extra configuration steps that must be done manually on all hosts in
the cluster. If you have not performed the steps, directory creation will fail in the installation wizard. In most
cases, you can create the directories but the steps performed by the installation wizard may have to be
continued manually. Click Continue.
4. If you chose to have Cloudera Manager install software, specify host installation properties:
• Select root or enter the user name for an account that has password-less sudo permission.
• Select an authentication method:
– If you choose password authentication, enter and confirm the password.
– If you choose public-key authentication, provide a passphrase and path to the required key files.
• You can specify an alternate SSH port. The default value is 22.
• You can specify the maximum number of host installations to run at once. The default value is 10.
5. Click Continue. If you chose to have Cloudera Manager install software, Cloudera Manager installs the Oracle
JDK, Cloudera Manager Agent, packages and CDH and managed service parcels or packages. During parcel
installation, progress is indicated for the phases of the parcel installation process in separate progress bars.
If you are installing multiple parcels, you see progress bars for each parcel. When the Continue button at the
bottom of the screen turns blue, the installation process is completed.
6. Click Continue. The Host Inspector runs to validate the installation and provides a summary of what it finds,
including all the versions of the installed components. If the validation is successful, click Finish.
Add Services
Use the Cloudera Manager wizard to configure and start CDH and managed services.
1. In the first page of the Add Services wizard, choose the combination of services to install and whether to
install Cloudera Navigator:
• Click the radio button next to the combination of services to install:
CDH 4 CDH 5
• Core Hadoop - HDFS, MapReduce, ZooKeeper, • Core Hadoop - HDFS, YARN (includes MapReduce
Oozie, Hive, and Hue 2), ZooKeeper, Oozie, Hive, Hue, and Sqoop
• Core with HBase • Core with HBase
• Core with Impala • Core with Impala
• All Services - HDFS, MapReduce, ZooKeeper, • Core with Search
HBase, Impala, Oozie, Hive, Hue, and Sqoop • Core with Spark
• Custom Services - Any combination of services. • All Services - HDFS, YARN (includes MapReduce
2), ZooKeeper, Oozie, Hive, Hue, Sqoop, HBase,
Impala, Solr, Spark, and Key-Value Store Indexer
• Custom Services - Any combination of services.
Note: You can create a YARN service in a CDH 4 cluster, but it is not considered production
ready.
– In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default MapReduce
computation framework. Choose Custom Services to install MapReduce, or use the Add Service
functionality to add MapReduce after installation completes.
Note: In CDH 5, the MapReduce service has been deprecated. However, the MapReduce
service is fully supported for backward compatibility through the CDH 5 lifecycle.
– The Flume service can be added only after your cluster has been set up.
• If you have chosen Data Hub Edition Trial or Cloudera Enterprise, optionally select the Include Cloudera
Navigator checkbox to enable Cloudera Navigator. See the Cloudera Navigator Documentation.
Click Continue.
2. Customize the assignment of role instances to hosts. The wizard evaluates the hardware configurations of
the hosts to determine the best hosts for each role. The wizard assigns all worker roles to the same set of
hosts to which the HDFS DataNode role is assigned. You can reassign role instances if necessary.
Click a field below a role to display a dialog containing a list of hosts. If you click a field containing multiple
hosts, you can also select All Hosts to assign the role to all hosts, or Custom to display the pageable hosts
dialog.
The following shortcuts for specifying hostname patterns are supported:
• Range of hostnames (without the domain portion)
• IP addresses
• Rack name
Click the View By Host button for an overview of the role assignment by hostname ranges.
3. When you are satisfied with the assignments, click Continue.
4. On the Database Setup page, configure settings for required databases:
a. Enter the database host, database type, database name, username, and password for the database that
you created when you set up the database.
b. Click Test Connection to confirm that Cloudera Manager can communicate with the database using the
information you have supplied. If the test succeeds in all cases, click Continue; otherwise, check and correct
the information you have provided for the database and then try the test again. (For some servers, if you
are using the embedded database, you will see a message saying the database will be created at a later
step in the installation process.) The Review Changes screen displays.
5. Review the configuration changes to be applied. Confirm the settings entered for file system paths. The file
paths required vary based on the services to be installed. If you chose to add the Sqoop service, indicate
whether to use the default Derby database or the embedded PostgreSQL database. If the latter, type the
database name, host, and user credentials that you specified when you created the database.
Warning: Do not place DataNode data directories on NAS devices. When resizing an NAS, block
replicas can be deleted, which will result in reports of missing blocks.
Deploying Clients
Client configuration files are generated automatically by Cloudera Manager based on the services you install.
Cloudera Manager deploys these configurations automatically at the end of the installation workflow. You can
also download the client configuration files to deploy them manually.
If you modify the configuration of your cluster, you may need to redeploy the client configuration files. If a service's
status is "Client configuration redeployment required," you need to redeploy those files.
See Client Configuration Files for information on downloading client configuration files, or redeploying them
through Cloudera Manager.
On the left side of the screen is a list of services currently running with their status information. All the services
should be running with Good Health . You can click on each service to view more detailed information about
each service. You can also test your installation by either checking each Host's heartbeats, running a MapReduce
job, or interacting with the cluster with an existing Hue application.
or create and run the WordCount v1.0 application described in Hadoop Tutorial.
3. Depending on whether your cluster is configured to run MapReduce jobs on the YARN or MapReduce service,
view the results of running the job by selecting one of the following from the top navigation bar in the Cloudera
Manager Admin Console :
b. Click Stop in the confirmation screen. The Command Details window shows the progress of stopping
services. When All services successfully stopped appears, the task is complete and you can close the
Command Details window.
c.
On the Home page, click to the right of the Cloudera Management Service entry and select Stop. The
Command Details window shows the progress of stopping services. When All services successfully
stopped appears, the task is complete and you can close the Command Details window.
2. Stop the Cloudera Management Service.
Deactivate and Remove Parcels
If you installed using packages, skip this step and go to Uninstall the Cloudera Manager Server on page 159; you
will remove packages in Uninstall Cloudera Manager Agent and Managed Software on page 160. If you installed
using parcels remove them as follows:
1. Click the parcel indicator in the main navigation bar.
2. For each activated parcel, select Actions > Deactivate. When this action has completed, the parcel button
changes to Activate.
3. For each activated parcel, select Actions > Remove from Hosts. When this action has completed, the parcel
button changes to Distribute.
4. For each activated parcel, select Actions > Delete. This removes the parcel from the local parcel repository.
There may be multiple parcels that have been downloaded and distributed, but that are not active. If this is the
case, you should also remove those parcels from any hosts onto which they have been distributed, and delete
the parcels from the local repository.
Delete the Cluster
On the Home page, Click the drop-down list next to the cluster you want to delete and select Delete.
Uninstall the Cloudera Manager Server
The commands for uninstalling the Cloudera Manager Server depend on the method you used to install it. Refer
to steps below that correspond to the method you used to install the Cloudera Manager Server.
• If you used the cloudera-manager-installer.bin file - Run the following command on the Cloudera Manager
Server host:
$ sudo /usr/share/cmf/uninstall-cloudera-manager.sh
Note: If the uninstall-cloudera-manager.sh is not installed on your cluster, use the following
instructions to uninstall the Cloudera Manager Server.
• If you did not use the cloudera-manager-installer.bin file - If you installed the Cloudera Manager Server
using a different installation method such as Puppet, run the following commands on the Cloudera Manager
Server host.
1. Stop the Cloudera Manager Server and its database:
2. Uninstall the Cloudera Manager Server and its database. This process described also removes the embedded
PostgreSQL database software, if you installed that option. If you did not use the embedded PostgreSQL
database, omit the cloudera-manager-server-db steps.
Red Hat systems:
SLES systems:
Debian/Ubuntu systems:
Debian/Ubuntu systems:
2. Uninstall software:
SLES
Debian/Ubuntu
$ for u in cloudera-scm flume hadoop hdfs hbase hive httpfs hue impala llama mapred
oozie solr spark sqoop sqoop2 yarn zookeeper; do sudo kill $(ps -u $u -o pid=); done
Note: This step should not be necessary if you stopped all the services and the Cloudera Manager
Agent correctly.
$ sudo rm /tmp/.scm_prepare_node.lock
Run the following command on each data drive on all Agent hosts (adjust the paths for the data drives on each
host):
Note: For additional information about uninstalling CDH, including clean-up of CDH files, see the
entry on Uninstalling CDH Components in the CDH4 Installation Guide or Cloudera Installation and
Upgrade.
Related Information
• Cloudera Navigator 2 Overview
• Upgrading Cloudera Navigator on page 478
• Cloudera Navigator Administration
• Cloudera Data Management
• Configuring Authentication in Cloudera Navigator
• Configuring SSL for Cloudera Navigator
• Cloudera Navigator User Roles
Important:
• When starting, stopping and restarting CDH components, always use the service (8) command
rather than running scripts in /etc/init.d directly. This is important because service sets the
current working directory to / and removes most environment variables (passing only LANG and
TERM), to create a predictable environment for the service. If you run the scripts in /etc/init.d,
locally-set environment variables could produce unpredictable results. If you install CDH from
RPMs, service will be installed as part of the Linux Standard Base (LSB).
• Upgrading from CDH 4: If you are upgrading from CDH 4, you must first uninstall CDH 4, then install
CDH 5; see Upgrading from CDH 4 to CDH 5 on page 573.
• On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES
distribution; Hadoop will not run correctly with that version. Install the Oracle JDK following
directions under Java Development Kit Installation.
• If you are migrating from MapReduce v1 (MRv1) to MapReduce v2 (MRv2, YARN), see Migrating
from MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN) on page 182 for important information
and instructions.
Before you install CDH 5 on a cluster, there are some important steps you need to do to prepare your system:
1. Verify you are using a supported operating system for CDH 5. See CDH 5 Requirements and Supported
Versions on page 19.
2. If you haven't already done so, install the Oracle Java Development Kit. For instructions and recommendations,
see Java Development Kit Installation.
Scheduler Defaults
Note the following differences between MRv1 (MapReduce) and MRv2 (YARN).
• MRv1 (MapReduce v1):
– Cloudera Manager and CDH 5 set the default to FIFO.
FIFO is set as the default for backward-compatibility purposes, but Cloudera recommends Fair Scheduler.
Capacity Scheduler is also available.
• MRv2 (YARN):
– Cloudera Manager and CDH 5 set the default to Fair Scheduler.
Cloudera recommends Fair Scheduler because Impala and Llama are optimized for it. FIFO and Capacity
Scheduler are also available.
High Availability
In CDH 5 you can configure high availability both for the NameNode and the JobTracker or Resource Manager.
• For more information and instructions on setting up a new HA configuration, see High Availability.
Important:
If you decide to configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
After completing the HDFS HA software configuration, follow the installation instructions under
Deploying HDFS High Availability.
• To upgrade an existing configuration, follow the instructions under Upgrading to CDH 5 on page 574.
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
This section explains how to set up a local yum repository to install CDH on the machines in your cluster. There
are a number of reasons you might want to do this, for example:
• The computers in your cluster may not have Internet access. You can still use yum to do an installation on
those machines by creating a local yum repository.
• You may want to keep a stable local repository to ensure that any new installations (or re-installations on
existing cluster members) use exactly the same bits.
• Using a local repository may be the most efficient way to distribute the software to the cluster members.
To set up your own internal mirror, follow the steps below. You need an internet connection for the steps that
require you to download packages and create the repository itself. You will also need an internet connection in
order to download updated RPMs to your local repository.
1. Click the entry in the table below that matches your RHEL or CentOS system, navigate to the repo file for
your system and save it in the /etc/yum.repos.d/ directory.
2. Install a web server such as apache/lighttpd on the machine which will serve the RPMs. The default
configuration should work. HTTP access must be allowed to pass through any firewalls between this server
and the internet connection.
3. On the server with the web server,, install the yum-utils and createrepo RPM packages if they are not
already installed. The yum-utils package includes the reposync command, which is required to create the
local Yum repository.
4. On the same computer as in the previous steps, download the yum repository into a temporary location. On
RHEL/CentOS 6, you can use a command such as:
reposync -r cloudera-cdh5
You can replace with any alpha-numeric string. It will be the name of your local repository, used in the header
of the repo file other systems will use to connect to your repository. You can now disconnect your server
from the internet.
5. Put all the RPMs into a directory served by your web server, such as /var/www/html/cdh/5/RPMS/noarch/
(or x86_64 or i386 instead of noarch). The directory structure 5/RPMS/noarch is required. Make sure you
can remotely access the files in the directory via HTTP, using a URL similar to
http://<yourwebserver>/cdh/5/RPMS/).
6. On your web server, issue the following command from the 5/ subdirectory of your RPM directory:
createrepo .
This will create or update the metadata required by the yum command to recognize the directory as a repository.
The command creates a new directory called repodata. If necessary, adjust the permissions of files and
directories in your entire repository directory to be readable by the web server user.
7. Edit the repo file you downloaded in step 1 and replace the line starting with baseurl= or mirrorlist= with
baseurl=http://<yourwebserver>/cdh/5/, using the URL from step 5. Save the file back to
/etc/yum.repos.d/.
8. While disconnected from the internet, issue the following commands to install CDH from your local yum
repository.
Example:
yum update
yum install hadoop
Once you have confirmed that your internal mirror works, you can distribute this modified repo file to any system
which can connect to your repository server. Those systems can now install CDH from your local repository
without internet access. Follow the instructions under Installing the Latest CDH 5 Release on page 166, starting
at Step 2 (you have already done Step 1).
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
Note: Cloudera recommends that you use this automated method if possible.
Note: The instructions in this Installation Guide are tailored for a package installation, as described
in the sections that follow, and do not cover installation or deployment from tarballs.
Note:
If you are migrating from MapReduce v1 (MRv1) to MapReduce v2 (MRv2, YARN), see Migrating from
MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN) on page 182 for important information and
instructions.
High Availability
In CDH 5 you can configure high availability both for the NameNode and the JobTracker or Resource Manager.
• For more information and instructions on setting up a new HA configuration, see High Availability.
Important:
If you decide to configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
After completing the HDFS HA software configuration, follow the installation instructions under
Deploying HDFS High Availability.
• To upgrade an existing configuration, follow the instructions under Upgrading to CDH 5 on page 574.
Note:
Use only one of the three methods.
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 3: Install CDH 5 with YARN
on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install both
implementations.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Now continue with Step 2: Optionally Add a Repository Key on page 173, and then choose Step 3: Install CDH 5
with YARN on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install
both implementations.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
This ensures that the system repositories contain the latest software (it does not actually install
anything).
On SLES Systems
Use one of the following methods to download the CDH 5 repository or package on SLES systems.
Note:
Use only one of the three methods.
Click this link, choose Save File, and save it to a directory to which you have write access (for example, your
home directory).
2. Install the RPM:
Now continue with Step 2: Optionally Add a Repository Key on page 173, and then choose Step 3: Install CDH 5
with YARN on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install
both implementations.
OR: To add the CDH 5 repository:
1. Run the following command:
Now continue with Step 2: Optionally Add a Repository Key on page 173, and then choose Step 3: Install CDH 5
with YARN on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install
both implementations.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Note:
• Use only one of the three methods.
• There is an extra step if you are adding a repository on Ubuntu Trusty, as described below.
• Unless you are adding a repository on Ubuntu Trusty, don't forget to run apt-get update after
downloading, adding, or building the repository.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Now continue with Step 2: Optionally Add a Repository Key on page 173, and then choose Step 3: Install CDH 5
with YARN on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install
both implementations.
OR: To add the CDH 5 repository:
• Download the appropriate cloudera.list file by issuing one of the following commands. You can use
another HTTP client if wget is not available, but the syntax may be different.
OS Version Command
Debian Wheezy $ sudo wget
'http://archive.cloudera.com/cdh5/ubuntu/wheezy/amd64/cdh/cloudera.list'
\
-O /etc/apt/sources.list.d/cloudera.list
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Package: *
Pin: release o=Cloudera, l=Cloudera
Pin-Priority: 501
Note:
You do not need to run apt-get update after creating this file.
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 3: Install CDH 5 with YARN
on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install both
implementations.
OR: To build a Debian repository:
If you want to create your own apt repository, create a mirror of the CDH Debian directory and then create an
apt repository from the mirror.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 3: Install CDH 5 with YARN
on page 173, or Step 4: Install CDH 5 with MRv1 on page 175; or do both steps if you want to install both
implementations.
Step 2: Optionally Add a Repository Key
Before installing YARN or MRv1: (Optionally) add a repository key on each system in the cluster. Add the Cloudera
Public GPG Key to your repository by executing one of the following commands:
• For Red Hat/CentOS/Oracle 5 systems:
OS Version Command
Debian Wheezy $ wget
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
-O archive.key
$ sudo apt-key add archive.key
This key enables you to verify that you are downloading genuine packages.
Step 3: Install CDH 5 with YARN
Note:
Skip this step if you intend to use only MRv1. Directions for installing MRv1 are in Step 4.
Note:
If you decide to configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
After completing the HA software configuration, follow the installation instructions under Deploying
HDFS High Availability.
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding.
This is a requirement if you are deploying high availability (HA) for the NameNode.
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-yarn-resourcemanager
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-yarn-nodemanager
hadoop-hdfs-datanode hadoop-mapreduce
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-mapreduce-historyserver
hadoop-yarn-proxyserver
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-client
Note:
The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as
dependencies of the other packages.
Note:
If you are also installing YARN, you can skip any packages you have already installed in Step 3: Install
CDH 5 with YARN on page 173.
Skip this step and go to Step 3: Install CDH 5 with YARN on page 173 if you intend to use only YARN.
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding.
This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.
Follow instructions under ZooKeeper Installation. Make sure you create the myid file in the data directory, as
instructed, if you are starting a ZooKeeper ensemble after a fresh install.
Next, install packages.
Install each type of daemon package on the appropriate systems(s), as follows.
Note:
On Ubuntu systems, Ubuntu may try to start the service immediately after you install it. This should
fail harmlessly, but if you want to prevent it, there is advice here.
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-jobtracker
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-tasktracker
hadoop-hdfs-datanode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-client
Note:
If you are upgrading to a new version of LZO, rather than installing it for the first time, you must first
remove the old version; for example, on a RHEL system:
1. Add the repository on each host in the cluster. Follow the instructions for your OS version:
Important: Make sure you do not let the file name default to
cloudera.list, as that will overwrite your existing cloudera.list.
3. Continue with installing and deploying CDH. As part of the deployment, you will need to do some additional
configuration for LZO, as shown under Configuring LZO on page 213.
Important: Make sure you do this configuration after you have copied the default configuration
files to a custom location and set alternatives to point to it.
Warning:
Do not attempt to use these instructions to roll your cluster back to a previous release. Use them
only to expand an existing cluster that you do not want to upgrade to the latest release, or to create
a new cluster running a version of CDH 5 that is earlier than the current CDH 5 release.
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.0.1/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.0.1/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
On SLES systems
The file should look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl= http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.0.1/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.0.1/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
Important: MapReduce MRv1 and YARN share a common set of configuration files, so it is safe to
configure both of them. Cloudera does not recommend running MapReduce MRv1 and YARN daemons
on the same hosts at the same time. If you want to easily switch between MapReduce MRv1 and
YARN, consider using Cloudera Manager features for managing these services.
and NodeManagers run on worker hosts instead of TaskTracker daemons. The per-application ApplicationMaster
is, in effect, a framework-specific library and negotiates resources from the ResourceManager and works with
the NodeManagers to execute and monitor the tasks. For details of this architecture, see Apache Hadoop NextGen
MapReduce (YARN).
See also Migrating from MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN) on page 182.
Introduction
MapReduce 2, or Next Generation MapReduce, is a long needed upgrade to the way that scheduling, resource
management, and execution occur in Hadoop. At their core, the improvements separate cluster resource
management capabilities from MapReduce-specific logic. They enable Hadoop to share resources dynamically
between MapReduce and other parallel processing frameworks, such as Impala, allow more sensible and
finer-grained resource configuration for better cluster utilization, and permit it to scale to accommodate more
and larger jobs.
This document provides a guide to both the architectural and user-facing changes, so that both cluster operators
and MapReduce programmers can easily make the transition.
The new architecture has its advantages. First, by breaking up the JobTracker into a few different services, it
avoids many of the scaling issues faced by MapReduce in Hadoop 1. More importantly, it makes it possible to
run frameworks other than MapReduce on a Hadoop cluster. For example, Impala can also run on YARN and
share resources with MapReduce.
Requesting Resources
A MapReduce job submission includes the amount of resources to reserve for each map and reduce task. As in
MapReduce 1, the amount of memory requested is controlled by the mapreduce.map.memory.mb and
mapreduce.reduce.memory.mb properties.
MapReduce 2 adds additional parameters that control how much processing power to reserve for each task as
well. The mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores properties express how much
parallelism a map or reduce task can take advantage of. These should remain at their default value of 1 unless
your code is explicitly spawning extra compute-intensive threads.
Note:
As of CDH 5.4.0, configuring MapReduce jobs is simpler than before: instead of having to set both the
heap size (mapreduce.map.java.opts or mapreduce.reduce.java.opts) and the container size
(mapreduce.map.memory.mb or mapreduce.reduce.memory.mb), you can now choose to set only
one of them; the other is inferred from mapreduce.job.heap.memory-mb.ratio. If don't specify
either of them, container size defaults to 1 GB and the heap size is inferred.
The impact on user jobs is as follows: for jobs that don't set heap size, this increases the JVM size
from 200 MB to a default 820 MB. This should be okay for most jobs, but streaming tasks might need
more memory because their Java process takes their total usage over the container size. Even in that
case, this would likely happen only for those tasks relying on aggressive GC to keep the heap under
200 MB.
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
• mapred-site.xml configuration
See Deploying MapReduce v2 (YARN) on a Cluster on page 215 for instructions for a full deployment.
Resource Configuration
One of the larger changes in MRv2 is the way that resources are managed. In MRv1, each host was configured
with a fixed number of map slots and a fixed number of reduce slots. Under YARN, there is no distinction between
resources available for maps and resources available for reduces - all resources are available for both. Second,
the notion of slots has been discarded, and resources are now configured in terms of amounts of memory (in
megabytes) and CPU (in “virtual cores”, which are described below). Resource configuration is an inherently
difficult topic, and the added flexibility that YARN provides in this regard also comes with added complexity.
Cloudera Manager will pick sensible values automatically, but if you are setting up your cluster manually or just
interested in the details, read on.
Configuring Memory Settings for YARN and MRv2
The memory configuration for YARN and MRv2 memory is important to get the best performance from your
cluster. Several different settings are involved. The table below shows the default settings, as well as the settings
that Cloudera recommends, for each configuration option. See Managing MapReduce and YARN for more
configuration specifics; and, for detailed tuning advice with sample configurations, see Tuning the Cluster for
MapReduce v2 (YARN) on page 195.
Resource Requests
From the perspective of a developer requesting resource allocations for a job’s tasks, nothing needs to be
changed. Map and reduce task memory requests still work and, additionally, tasks that will use multiple threads
can request more than 1 core with the mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores
properties.
Scheduler Configuration
Cloudera recommends using the Fair Scheduler in MRv2. (FIFO and Capacity Scheduler are also available.) Fair
Scheduler allocation files require changes in light of the new way that resources work. The minMaps, maxMaps,
minReduces, and maxReduces queue properties have been replaced with a minResources property and a
maxProperties. Instead of taking a number of slots, these properties take a value like “1024 MB, 3 vcores”. By
default, the MRv2 Fair Scheduler will attempt to equalize memory allocations in the same way it attempted to
equalize slot allocations in MRv1. The MRv2 Fair Scheduler contains a number of new features including
hierarchical queues and fairness based on multiple resources.
Administration Commands
The jobtracker and tasktracker commands, which start the JobTracker and TaskTracker, are no longer
supported because these services no longer exist. They are replaced with “yarn resourcemanager” and “yarn
nodemanager”, which start the ResourceManager and NodeManager respectively. “hadoop mradmin” is no
longer supported. Instead, “yarn rmadmin” should be used. The new admin commands mimic the functionality
of the MRv1 names, allowing nodes, queues, and ACLs to be refreshed while the ResourceManager is running.
Security
The following section outlines the additional changes needed to migrate a secure cluster.
New YARN Kerberos service principals should be created for the ResourceManager and NodeManager, using
the pattern used for other Hadoop services, that is, yarn@HOST. The mapred principal should still be used for
the JobHistory Server. If you are using Cloudera Manager to configure security, this will be taken care of
automatically.
As in MRv1, a configuration must be set to have the user that submits a job own its task processes. The equivalent
of MRv1’s LinuxTaskController is the LinuxContainerExecutor. In a secure setup, NodeManager configurations
should set yarn.nodemanager.container-executor.class to
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor. Properties set in the
taskcontroller.cfg configuration file should be migrated to their analogous properties in the
container-executor.cfg file.
In secure setups, configuring hadoop-policy.xml allows administrators to set up access control lists on internal
protocols. The following is a table of MRv1 options and their MRv2 equivalents:
security.job.submission.protocol.acl security.applicationclient.protocol.acl
security.admin.operations.protocol.acl security.resourcemanager-administration.protocol.acl
Queue access control lists (ACLs) are now placed in the Fair Scheduler configuration file instead of the JobTracker
configuration. A list of users and groups that can submit jobs to a queue can be placed in aclSubmitApps in
the queue’s configuration. The queue administration ACL is no longer supported, but will be in a future release.
Ports
The following is a list of default ports used by MRv2 and YARN, as well as the configuration properties used to
configure them.
Note:
You can set yarn.resourcemanager.hostname.id for each ResourceManager instead of setting
the ResourceManager values; this will cause YARN to use the default ports on those hosts.
High Availability
YARN supports ResourceManager HA to make a YARN cluster highly-available; the underlying architecture of
active-standby pair is similar to JobTracker HA in MRv1. A major improvement over MRv1 is: in YARN, the
completed tasks of in-flight MapReduce jobs are not re-run on recovery after the ResourceManager is restarted
or failed over. Further, the configuration and setup has also been simplified. The main differences are:
1. Failover controller has been moved from a separate ZKFC daemon to be a part of the ResourceManager
itself. So, there is no need to run an additional daemon.
2. Clients, applications, and NodeManagers do not require configuring a proxy-provider to talk to the active
ResourceManager.
Below is a table with HA-related configurations used in MRv1 and their equivalents in YARN:
.ZKRMStateStore
mapred.ha.automatic-failover.enabled yarn.resourcemanager.ha.automatic-failover.enabled Enable automatic
failover
mapred.ha.zkfc.port yarn.resourcemanager.ha.automatic-failover.port
mapred.job.tracker yarn.resourcemanager.cluster.id Cluster name
The next step is to look at all the service configurations placed in mapred-site.xml and replace them with their
corresponding YARN configuration. Configurations starting with yarn should be placed inside yarn-site.xml,
not mapred-site.xml. Refer to the Resource Configuration section above for best practices on how to convert
TaskTracker slot capacities (mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum) to NodeManager resource capacities
(yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores), as well as how
to convert configurations in the Fair Scheduler allocations file, fair-scheduler.xml.
Finally, you can start the ResourceManager, NodeManagers and the JobHistoryServer.
Web UI
In MRv1, the JobTracker Web UI served detailed information about the state of the cluster and the jobs (recent
and current) running on it. It also contained the job history page, which served information from disk about older
jobs.
The MRv2 Web UI provides the same information structured in the same way, but has been revamped with a
new look and feel. The ResourceManager’s UI, which includes information about running applications and the
state of the cluster, is now located by default at <ResourceManager host>:8088. The JobHistory UI is now located
by default at <JobHistoryServer host>:19888. Jobs can be searched and viewed there just as they could in
MapReduce 1.
Because the ResourceManager is meant to be agnostic to many of the concepts in MapReduce, it cannot host
job information directly. Instead, it proxies to a Web UI that can. If the job is running, this proxy is the relevant
MapReduce Application Master; if the job has completed, then this proxy is the JobHistoryServer. Thus, the user
experience is similar to that of MapReduce 1, but the information is now coming from different places.
Miscellaneous Properties
MRv1 Comment
mapreduce.tasktracker.group
mapred.child.ulimit
mapred.tasktracker.dns.interface
mapred.tasktracker.dns.nameserver
mapred.tasktracker.instrumentation NodeManager does not accept instrumentation
mapred.job.reuse.jvm.num.tasks JVM reuse no longer supported
mapreduce.job.jvm.numtasks JVM reuse no longer supported
mapred.task.tracker.report.address No need for this, as containers do not use IPC with
NodeManagers, and ApplicationMaster ports are
chosen at runtime
mapreduce.task.tmp.dir No longer configurable. Now always tmp/ (under
container's local dir)
mapred.child.tmp No longer configurable. Now always tmp/ (under
container's local dir)
mapred.temp.dir
mapred.jobtracker.instrumentation ResourceManager does not accept instrumentation
mapred.jobtracker.plugins ResourceManager does not accept plugins
mapred.task.cache.level
mapred.queue.names These go in the scheduler-specific configuration files
mapred.system.dir
mapreduce.tasktracker.cache.local.numberdirectories
mapreduce.reduce.input.limit
io.sort.record.percent Tuned automatically (MAPREDUCE-64)
mapred.cluster.map.memory.mb Not necessary; MRv2 uses resources instead of slots
mapred.cluster.reduce.memory.mb Not necessary; MRv2 uses resources instead of slots
mapred.max.tracker.blacklists
mapred.jobtracker.maxtasks.per.job Related configurations go in scheduler-specific
configuration files
mapred.jobtracker.taskScheduler.maxRunningTasksPerJob Related configurations go in scheduler-specific
configuration files
io.map.index.skip
mapred.user.jobconf.limit
mapred.local.dir.minspacestart
MRv1 Comment
mapred.local.dir.minspacekill
hadoop.rpc.socket.factory.class.JobSubmissionProtocol
mapreduce.tasktracker.outofband.heartbeat Always on
mapred.jobtracker.job.history.block.size
You can now configure YARN to use the remaining resources for its supervisory processes and task containers.
Start with the NodeManager, which has the following settings:
Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated
to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores
allocated to the NodeManager should be the lesser of either:
• (total vcores) – (number of vcores reserved for non-YARN use), or
• 2 x (number of physical disks used for DataNode storage)
The amount of RAM allotted to a NodeManager for spawning containers should be the difference between a
node’s physical RAM minus all non-YARN memory demand. So yarn.nodemanager.resource.memory-mb =
total memory on the node - (sum of all memory allocations to other processes such as DataNode, NodeManager,
RegionServer etc.) For the example node, assuming the DataNode has 10 physical drives, the calculation is:
Property Value
yarn.nodemanager.resource.cpu-vcores min(24 – 6, 2 x 10) = 18
yarn.nodemanager.resource.memory-mb 137,830 MB
If a NodeManager has 50 GB or more RAM available for containers, consider increasing the minimum allocation
to 2 GB. The default memory increment is 512 MB. For minimum memory of 1 GB, a container that requires 1.2
GB receives 1.5 GB. You can set maximum memory allocation equal to yarn.nodemanager.resource.memory-mb.
The default minimum and increment value for vcores is 1. Because application tasks are not commonly
multithreaded, you generally do not need to change this value. The maximum value is usually equal to
yarn.nodemanager.resource.cpu-vcores. Reduce this value to limit the number of containers running
concurrently on one node.
The example leaves more than 50 GB RAM available for containers, which accommodates the following settings:
Property Value
yarn.scheduler.minimum-allocation-mb 2,048 MB
yarn.scheduler.maximum-allocation-mb 137,830 MB
yarn.scheduler.maximum-allocation-vcores 18
The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for
mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify
memory allotted their containers, and the value assigned should allow overhead beyond the task heap size.
Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap
setting. The optimal value depends on the actual tasks. Cloudera also recommends setting
mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value.
The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent
tasks. Using these guides, size the example worker node as follows:
Property Value
mapreduce.map.memory.mb 2048 MB
mapreduce.reduce.memory.mb 4096 MB
mapreduce.map.java.opts.max.heap 0.8 x 2,048 = 1,638 MB
mapreduce.reduce.java.opts.max.heap 0.8 x 4,096 = 3,277 MB
Defining Containers
With YARN worker resources configured, you can determine how many containers best support a MapReduce
application, based on job type and system resources. For example, a CPU-bound workload such as a Monte Carlo
simulation requires very little data but complex, iterative processing. The ratio of concurrent containers to spindle
is likely greater than for an ETL workload, which tends to be I/O-bound. For applications that use a lot of memory
in the map or reduce phase, the number of containers that can be scheduled is limited by RAM available to the
container and the RAM required by the task. Other applications may be limited based on vcores not in use by
other YARN applications or the rules employed by dynamic resource pools (if used).
To calculate the number of containers for mappers and reducers based on actual system constraints, start with
the following formulas:
Property Value
mapreduce.job.maps MIN(yarn.nodemanager.resource.memory-mb /
mapreduce.map.memory.mb, yarn.nodemanager.resource.cpu-vcores /
mapreduce.map.cpu.vcores, number of physical drives x workload factor)
x number of worker nodes
mapreduce.job.reduces MIN(yarn.nodemanager.resource.memory-mb /
mapreduce.reduce.memory.mb, yarn.nodemanager.resource.cpu-vcores
/ mapreduce.reduce.cpu.vcores, # of physical drives x workload factor) x
# of worker nodes
The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.
Many other factors can influence the performance of a MapReduce application, including:
• Configured rack awareness
• Skewed or imbalanced data
• Network throughput
• Co-tenancy demand (other services or applications using the cluster)
• Dynamic resource pooling
You may also have to maximize or minimize cluster utilization for your workload or to meet Service Level
Agreements (SLAs). To find the best resource configuration for an application, try various container and
gateway/client settings and record the results.
For example, the following TeraGen/TeraSort script supports throughput testing with a 10-GB data load and a
loop of varying YARN container and gateway/client settings. You can observe which configuration yields the
best results.
#!/bin/sh
HADOOP_PATH=/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce
for i in 2 4 8 16 32 64 # Number of mapper containers to test
do
for j in 2 4 8 16 32 64 # Number of reducer containers to test
do
for k in 1024 2048 # Container memory for mappers/reducers to test
do
MAP_MB=`echo "($k*0.8)/1" | bc` # JVM heap size for mappers
RED_MB=`echo "($k*0.8)/1" | bc` # JVM heap size for reducers
done
done
done
Note: Do the tasks in this section after installing the latest version of CDH; see Installing the Latest
CDH 5 Release on page 166.
Configuring Dependencies
This section explains the tasks you must perform before deploying CDH on a cluster.
Enabling NTP
CDH requires that you configure the Network Time Protocol (NTP) service on each machine in your cluster. To
start NTP and configure it to run automatically on reboot, perform the following steps on each node in your
cluster.
1. Install NTP.
• For RHEL, CentOS, and Oracle:
• For SLES:
2. Open the /etc/ntp.conf file and add NTP servers, as in the following example.
server 0.pool.ntp.org
server 1.pool.ntp.org
server 2.pool.ntp.org
chkconfig ntpd on
ntpdate -u <your_ntp_server>
hwclock --systohc
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
To ensure that the members of the cluster can communicate with each other, do the following on every system.
Important:
CDH requires IPv4. IPv6 is not supported.
1. Set the hostname of each system to a unique name (not localhost). For example:
Note: This is a temporary measure only. The hostname set by hostname does not survive across
reboots.
2. Make sure the /etc/hosts file on each system contains the IP addresses and fully-qualified domain names
(FQDN) of all the members of the cluster.
Important:
• The canonical name of each host in /etc/hosts must be the FQDN (for example
myhost-1.mynet.myco.com), not the unqualified hostname (for example myhost-1). The
canonical name is the first entry after the IP address.
• Do not use aliases, either in /etc/hosts or in configuring DNS.
If you are using DNS, storing this information in /etc/hosts is not required, but it is good practice.
3. Make sure the /etc/sysconfig/network file on each system contains the hostname you have just set (or
verified) for that system, for example myhost-1.
4. Check that this system is consistently identified to the network:
a. Run uname -a and check that the hostname matches the output of the hostname command.
b. Run /sbin/ifconfig and note the value of inet addr in the eth0 entry, for example:
$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:0C:29:A4:E8:97
inet addr:172.29.82.176 Bcast:172.29.87.255 Mask:255.255.248.0
...
c. Run host -v -t A `hostname` and make sure that hostname matches the output of the hostname
command, and has the same IP address as reported by ifconfig for eth0; for example:
$ host -v -t A `hostname`
Trying "myhost.mynet.myco.com"
...
;; ANSWER SECTION:
myhost.mynet.myco.com. 60 IN A 172.29.82.176
5. For MRv1: make sure conf/core-site.xml and conf/mapred-site.xml, respectively, have the hostnames
– not the IP addresses – of the NameNode and the JobTracker. These can be FQDNs (for example
myhost-1.mynet.myco.com), or unqualified hostnames (for example myhost-1). See Customizing
Configuration Files and Deploying MapReduce v1 (MRv1) on a Cluster.
6. For YARN: make sure conf/core-site.xml and conf/yarn-site.xml, respectively, have the hostnames
– not the IP addresses – of the NameNode, the ResourceManager, and the ResourceManager Scheduler. See
Customizing Configuration Files and Deploying MapReduce v2 (YARN) on a Cluster.
7. Make sure that components that depend on a client-server relationship – Oozie, HBase, ZooKeeper – are
configured according to the instructions on their installation pages:
• Oozie Installation
• HBase Installation
• ZooKeeper Installation
Disabling SELinux
Security-Enhanced Linux (SELinux) allows you to set access control through policies. You must disable SELinux
on each host before you deploy CDH on your cluster.
To disable SELinux, perform the following steps on each host.
1. Check the SELinux state.
getenforce
If the output is either permissive or disabled, you can skip this task and go to Disabling the Firewall on
page 202. If the output is enforcing, continue to the next step.
2. Open the /etc/selinux/config file (in some systems, the /etc/sysconfig/selinux file).
3. Change the line SELINUX=enforcing to SELINUX=permissive.
4. Save and close the file.
5. Restart your system or run the following command to disable SELinux immediately:
setenforce 0
2. Disable iptables.
• For RHEL, CentOS, Oracle, and Debian:
and
/etc/init.d/iptables stop
• For SLES:
and
rcSuSEfirewall2 stop
• For Ubuntu:
Important:
For instructions for configuring High Availability (HA) for the NameNode, see HDFS High Availability.
For instructions on using HDFS Access Control Lists (ACLs), see HDFS Extended ACLs.
Proceed as follows to deploy HDFS on a cluster. Do this for all clusters, whether you are deploying MRv1 or
YARN:
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
You can call this configuration anything you like; in this example, it's called my_cluster.
Important:
When performing the configuration tasks in this section, and when you go on to deploy MRv1 or
YARN, edit the configuration files in this custom directory. Do not create your custom configuration
in the default directory /etc/hadoop/conf.empty.
2. CDH uses the alternatives setting to determine which Hadoop configuration to use. Set alternatives
to point to your custom directory, as follows.
To manually set the configuration on Red Hat-compatible systems:
Because the configuration in /etc/hadoop/conf.my_cluster has the highest priority (50), that is the one CDH
will use. For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and
SLES systems or the alternatives(8) man page On Red Hat-compatible systems.
Customizing Configuration Files
The following tables show the most important properties that you must configure for your cluster.
Note:
For information on other important configuration properties, and the configuration files, see the
Apache Cluster Setup page.
Sample Configuration
core-site.xml:
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode-host.company.com:8020</value>
</property>
hdfs-site.xml:
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
Note:
dfs.data.dir and dfs.name.dir are deprecated; you should use dfs.datanode.data.dir and
dfs.namenode.name.dir instead, though dfs.data.dir and dfs.name.dir will still work.
Sample configuration:
hdfs-site.xml on the NameNode:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/dn,file:///data/4/dfs/dn</value>
</property>
After specifying these directories as shown above, you must create the directories and assign the correct file
permissions to them on each node in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path
examples to match your configuration.
Local directories:
Important:
If you are using High Availability (HA), you should not configure these directories on an NFS mount;
configure them on local storage.
3. Configure the owner of the dfs.name.dir or dfs.namenode.name.dir directory, and of the dfs.data.dir
or dfs.datanode.data.dir directory, to be the hdfs user:
Note:
For a list of the users created when you install CDH, see Hadoop Users in Cloudera Manager and
CDH.
Here is a summary of the correct owner and permissions of the local directories:
Footnote: 1 The Hadoop daemons automatically set the correct permissions for you on dfs.data.dir or
dfs.datanode.data.dir. But in the case of dfs.name.dir or dfs.namenode.name.dir, permissions are
currently incorrectly set to the file-system default, usually drwxr-xr-x (755). Use the chmod command to
reset permissions for these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for
example:
or
Note:
If you specified nonexistent directories for the dfs.data.dir or dfs.datanode.data.dir property
in the hdfs-site.xml file, CDH 5 will shut down. (In previous releases, CDH silently ignored
nonexistent directories for dfs.data.dir.)
Note:
It is important that dfs.datanode.failed.volumes.tolerated not be configured to tolerate too
many directory failures, as the DataNode will perform poorly if it has few functioning data directories.
Important:
• Make sure you format the NameNode as user hdfs.
• If you are re-formatting the NameNode, keep in mind that this invalidates the DataNode storage
locations, so you should remove the data under those locations after the NameNode is formatted.
Note:
If Kerberos is enabled, do not use commands in the form sudo -u <user> hadoop <command>; they
will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using
a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each
command executed by this user, $ <command>
Note: Respond with an upper-case Y; if you use lower case, the process will abort.
tcp,soft,intr,timeo=10,retrans=10
These options configure a soft mount over TCP; transactions will be retried ten times (retrans=10) at 1-second
intervals (timeo=10) before being deemed to have failed.
Example:
where <server> is the remote host, <export> is the exported file system, and <mount_point> is the local
mount point.
Note:
Cloudera recommends similar settings for shared HA mounts, as in the example that follows.
Note that in the HA case timeo should be set to 50 (five seconds), rather than 10 (1 second), and retrans should
be set to 12, giving an overall timeout of 60 seconds.
For more information, see the man pages for mount and nfs.
Important:
The Secondary NameNode does not provide failover or High Availability (HA). If you intend to configure
HA for the NameNode, skip this section: do not install or configure the Secondary NameNode (the
Standby NameNode performs checkpointing). After completing the HA software configuration, follow
the installation instructions under Deploying HDFS High Availability.
In non-HA deployments, configure a Secondary NameNode that will periodically merge the EditLog with the
FSImage, creating a new FSImage which incorporates the changes which were in the EditLog. This reduces the
amount of disk space consumed by the EditLog on the NameNode, and also reduces the restart time for the
Primary NameNode.
A standard Hadoop cluster (not a Hadoop Federation or HA configuration), can have only one Primary NameNode
plus one Secondary NameNode. On production systems, the Secondary NameNode should run on a different
machine from the Primary NameNode to improve scalability (because the Secondary NameNode does not
compete with the NameNode for memory and other resources to create the system snapshot) and durability
(because the copy of the metadata is on a separate machine that is available if the NameNode hardware fails).
<property>
<name>dfs.namenode.http-address</name>
<value><namenode.host.address>:50070</value>
<description>
The address and the base port on which the dfs NameNode Web UI will listen.
</description>
</property>
Note:
• dfs.http.address is deprecated; use dfs.namenode.http-address.
• In most cases, you should set dfs.namenode.http-address to a routable IP address with
port 50070. However, in some cases such as Amazon EC2, when the NameNode should bind
to multiple local addresses, you may want to set dfs.namenode.http-address to
0.0.0.0:50070 on the NameNode machine only, and set it to a real, routable address on the
Secondary NameNode machine. The different addresses are needed in this case because HDFS
uses dfs.namenode.http-address for two different purposes: it defines both the address
the NameNode binds to, and the address the Secondary NameNode connects to for
checkpointing. Using 0.0.0.0 on the NameNode allows the NameNode to bind to all its local
addresses, while using the externally-routable address on the Secondary NameNode provides
the Secondary NameNode with a real address to connect to.
See http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
for details.
Enabling Trash
The Hadoop trash feature helps prevent accidental deletion of files and directories. If trash is enabled and a file
or directory is deleted using the Hadoop shell, the file is moved to the .Trash directory in the user's home
directory instead of being deleted. Deleted files are initially moved to the Current sub-directory of the .Trash
directory, and their original path is preserved. If trash checkpointing is enabled, the Current directory is periodically
renamed using a timestamp. Files in .Trash are permanently removed after a user-configurable time delay.
Files and directories in the trash can be restored simply by moving them to a location outside the .Trash
directory.
Important:
• The trash feature is disabled by default. Cloudera recommends that you enable it on all production
clusters.
• The trash feature works by default only for files and directories deleted using the Hadoop shell.
Files or directories deleted programmatically using other interfaces (WebHDFS or the Java APIs,
for example) are not moved to trash, even if trash is enabled, unless the program has implemented
a call to the trash functionality. (Hue, for example, implements trash as of CDH 4.4.)
Users can bypass trash when deleting files using the shell by specifying the -skipTrash option
to the hadoop fs -rm -r command. This can be useful when it is necessary to delete files that
are too large for the user's quota.
For example, to enable trash so that files deleted using the Hadoop shell are not deleted for 24 hours, set the
value of the fs.trash.interval property in the server's core-site.xml file to a value of 1440.
Note:
The period during which a file remains in the trash starts when the file is moved to the trash, not
when the file is last modified.
Note: Keep in mind that if usage is markedly imbalanced among a given DataNode's storage volumes
when you enable storage balancing, throughput on that DataNode will be affected initially, as writes
are disproportionately directed to the under-utilized volumes.
Enabling WebHDFS
Note:
To configure HttpFs instead, see HttpFS Installation on page 319.
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
^[A-Za-z_][A-Za-z0-9._-]*[$]?$
You can override the default username pattern by setting the dfs.webhdfs.user.provider.user.pattern
property in hdfs-site.xml. For example, to allow numerical usernames, the property can be set as follows:
<property>
<name>dfs.webhdfs.user.provider.user.pattern</name>
<value>^[A-Za-z0-9_][A-Za-z0-9._-]*[$]?$</value>
</property>
Important: The username pattern should be compliant with the requirements of the operating system
in use. Hence, Cloudera recommends you use the default pattern and avoid modifying the
dfs.webhdfs.user.provider.user.pattern property when possible.
Note:
• To use WebHDFS in a secure cluster, you must set additional properties to configure secure
WebHDFS. For instructions, see the Cloudera Security guide.
• When you use WebHDFS in a high-availability (HA) configuration, you must supply the value of
dfs.nameservices in the WebHDFS URI, rather than the address of a particular NameNode; for
example:
hdfs dfs -ls webhdfs://nameservice1/, not
Configuring LZO
If you have installed LZO, configure it as follows.
To configure LZO:
Set the following property in core-site.xml.
Note:
If you copy and paste the value string, make sure you remove the line-breaks and carriage returns,
which are included below because of page-width constraints.
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
$ scp -r /etc/hadoop/conf.my_cluster
myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster
For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES
systems or the alternatives(8) man page On Red Hat-compatible systems.
Start HDFS
Start HDFS on each node in the cluster, as follows:
Note:
This starts all the CDH services installed on the node. This is normally what you want, but you can
start services individually if you prefer.
Important:
If you do not create /tmp properly, with the right permissions as shown below, you may have problems
with CDH components later. Specifically, if you don't create /tmp yourself, another process may create
it automatically with restrictive permissions that will prevent your other applications from using it.
Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:
Note:
If Kerberos is enabled, do not use commands in the form sudo -u <user> hadoop <command>; they
will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using
a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each
command executed by this user, $ <command>
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
This section describes configuration tasks for YARN clusters only, and is specifically tailored for administrators
who have installed YARN from packages.
Important:
Do the following tasks after you have configured and deployed HDFS:
to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce
(YARN).
See also Selecting Appropriate JAR files for your Jobs on page 183.
Important:
Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This
is not recommended, especially in a cluster that is not managed by Cloudera Manager; it will degrade
performance and may result in an unstable cluster deployment.
• If you have installed YARN from packages, follow the instructions below to deploy it. (To deploy
MRv1 instead, see Deploying MapReduce v1 (MRv1) on a Cluster.)
• If you have installed CDH 5 from tarballs, the default deployment is YARN. Keep in mind that the
instructions on this page are tailored for a deployment following installation from packages.
Note:
Edit these files in the custom directory you created when you copied the Hadoop configuration. When
you have finished, you will push this configuration to all the nodes in the cluster; see Step 5.
Sample Configuration:
mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn.log.aggregation-enable true
Next, you need to specify, create, and assign the correct permissions to the local directories where you want the
YARN daemons to store data.
You specify the directories by configuring the following two properties in the yarn-site.xml file on all cluster
nodes:
Property Description
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager.company.com</value>
</property>
<property>
<description>Classpath for typical applications.</description>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///data/1/yarn/local,file:///data/2/yarn/local,file:///data/3/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file:///data/1/yarn/logs,file:///data/2/yarn/logs,file:///data/3/yarn/logs</value>
</property>
<property>
<name>yarn.log.aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://<namenode-host.company.com>:8020/var/log/hadoop-yarn/apps</value>
</property>
After specifying these directories in the yarn-site.xml file, you must create the directories and assign the
correct file permissions to them on each node in your cluster.
In the following instructions, local path examples are used to represent Hadoop parameters. Change the path
examples to match your configuration.
To configure local storage directories for use by YARN:
1. Create the yarn.nodemanager.local-dirs local directories:
Here is a summary of the correct owner and permissions of the local directories:
In addition, make sure proxying is enabled for the mapred user; configure the following properties in
core-site.xml:
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
2. Once HDFS is up and running, you will create this directory and a history subdirectory under it (see Step 8).
Alternatively, you can do the following:
1. Configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir in
mapred-site.xml.
2. Create these two directories.
3. Set permissions on mapreduce.jobhistory.intermediate-done-dir to 1777.
4. Set permissions on mapreduce.jobhistory.done-dir to 750.
If you configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir
as above, you can skip Step 8.
Step 5: If Necessary, Deploy your Custom Configuration to your Entire Cluster
Deploy the configuration on page 214 if you have not already done so.
Step 6: If Necessary, Start HDFS on Every Node in the Cluster
Start HDFS on page 213 if you have not already done so.
Important:
If you do not create /tmp properly, with the right permissions as shown below, you may have problems
with CDH components later. Specifically, if you don't create /tmp yourself, another process may create
it automatically with restrictive permissions that will prevent your other applications from using it.
Step 8: Create the history Directory and Set Permissions and Owner
This is a subdirectory of the staging directory you configured in Step 4. In this example we're using /user/history.
Create it and set permissions as follows:
Note:
See also Step 2.
Note:
You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is
explicitly configured in yarn-site.xml.
Note:
Make sure you always start ResourceManager before starting NodeManager services.
On each NodeManager system (typically the same ones where DataNode service runs):
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
This section describes configuration and startup tasks for MRv1 clusters only.
Important: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the
same time. This is not recommended; it will degrade performance and may result in an unstable
cluster deployment.
• Follow the instructions on this page to deploy MapReduce v1 (MRv1).
• If you have installed YARN and want to deploy it instead of MRv1, follow these instructions instead
of the ones below.
• If you have installed CDH 5 from tarballs, the default deployment is YARN.
Important: Do these tasks after you have configured and deployed HDFS:
Note: Edit these files in the custom directory you created when you copied the Hadoop configuration.
Note: For instructions on configuring a highly available JobTracker, see MapReduce (MRv1) JobTracker
High Availability; you need to configure mapred.job.tracker differently in that case, and you must
not use the port number.
Sample configuration:
mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>jobtracker-host.company.com:8021</value>
</property>
Sample configuration:
mapred-site.xml on each TaskTracker:
<property>
<name>mapred.local.dir</name>
<value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>
After specifying these directories in the mapred-site.xml file, you must create the directories and assign the
correct file permissions to them on each node in your cluster.
To configure local storage directories for use by MapReduce:
In the following instructions, local path examples are used to represent Hadoop parameters. The
mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local,
/data/3/mapred/local, and /data/4/mapred/local path examples. Change the path examples to match
your configuration.
1. Create the mapred.local.dir local directories:
Owner Permissions
mapred:hadoop drwxr-xr-x
configured number of directory failures). Here is an example health script that exits if the DataNode process is
not running:
#!/bin/bash
if ! jps | grep -q DataNode ; then
echo ERROR: datanode not up
fi
In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk
failure will result in the failure of both a dfs.data.dir and mapred.local.dir.
See the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation for
further details.
Step 4: Configure JobTracker Recovery
JobTracker recovery means that jobs that are running when JobTracker fails (for example, because of a system
crash or hardware failure) are re-run when the JobTracker is restarted. Any jobs that were running at the time
of the failure will be re-run from the beginning automatically.
A recovered job will have the following properties:
• It will have the same job ID as when it was submitted.
• It will run under the same user as the original job.
• It will write to the same output directory as the original job, overwriting any previous output.
• It will show as RUNNING on the JobTracker web page after you restart the JobTracker.
Enabling JobTracker Recovery
By default JobTracker recovery is off, but you can enable it by setting the property
mapreduce.jobtracker.restart.recover to true in mapred-site.xml.
Important:
If you do not create /tmp properly, with the right permissions as shown below, you may have problems
with CDH components later. Specifically, if you don't create /tmp yourself, another process may create
it automatically with restrictive permissions that will prevent your other applications from using it.
Important:
If you create the mapred.system.dir directory in a different location, specify that path in the
conf/mapred-site.xml file.
When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------,
assuming the user mapred owns that directory.
Step 11: Start MapReduce
To start MapReduce, start the TaskTracker and JobTracker services
On each TaskTracker system:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
Important:
Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This
is not recommended; it will degrade your performance and may result in an unstable MapReduce
cluster deployment.
To start the Hadoop daemons at boot time and on restarts, enable their init scripts on the systems on which
the services will run, using the chkconfig tool. See Configuring init to Start Core Hadoop System Services.
Non-core services can also be started at boot time; after you install the non-core components, see Configuring
init to Start Non-core Hadoop System Services for instructions.
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
CDH 5 Components
Use the following sections to install or upgrade CDH 5 components:
• Crunch Installation on page 227
• Flume Installation on page 228
• HBase Installation on page 239
• HCatalog Installation on page 271
• Hive Installation on page 291
• HttpFS Installation on page 319
• Hue Installation on page 322
• Impala Installation on page 277
• KMS Installation on page 351
• Mahout Installation on page 352
• Oozie Installation on page 355
• Pig Installation on page 374
• Search Installation on page 378
• Sentry Installation on page 391
• Snappy Installation on page 392
• Spark Installation on page 394
Crunch Installation
The Apache Crunch™ project develops and supports Java APIs that simplify the process of creating data pipelines
on top of Apache Hadoop. The Crunch APIs are modeled after FlumeJava (PDF), which is the library that Google
uses for building data pipelines on top of their own implementation of MapReduce.
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its
goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and
efficient to run. Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch library is a simple
Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs
are especially useful when processing data that does not fit naturally into relational model, such as time series,
serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users,
there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for
creating MapReduce pipelines.
The following sections describe how to install Crunch:
• Crunch Prerequisites on page 227
• Crunch Packaging on page 227
• Installing and Upgrading Crunch on page 227
• Crunch Documentation on page 228
Crunch Prerequisites
• An operating system supported by CDH 5
• Oracle JDK
Crunch Packaging
The packaging options for installing Crunch are:
• RPM packages
• Debian packages
There are two Crunch packages:
• crunch: provides all the functionality of crunch allowing users to create data pipelines over execution engines
like MapReduce, Spark, etc.
• crunch-doc: the documentation package.
Note: Crunch is also available as a parcel, included with the CDH 5 parcel. If you install CDH 5 with
Cloudera Manager, Crunch will be installed automatically.
Crunch Documentation
For more information about Crunch, see the following documentation:
• Getting Started with Crunch
• Apache Crunch User Guide
Flume Installation
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving
large amounts of log data from many different sources to a centralized data store.
The following sections provide more information and instructions:
• Upgrading Flume on page 228
• Packaging
• Installing a Tarball
• Installing Packages
• Configuring Flume
• Verifying the Installation
• Running Flume
• Files Installed by the Packages
• Supported Sources, Sinks, and Channels
• Using and On-disk Encrypted File Channel
• Apache Flume Documentation
Note:
To install Flume using Cloudera Manager, see Managing Flume.
Upgrading Flume
Use the instructions that follow to upgrade Flume.
On SLES systems:
On Ubuntu systems:
You must uninstall Flume 0.9.x and then install Flume 1.x, as follows.
Step 1: Remove Flume 0.9.x from your cluster.
1. Stop the Flume Node processes on each node where they are running:
On SLES systems:
On Ubuntu systems:
Flume Packaging
There are currently three packaging options available for installing Flume:
• Tarball (.tar.gz)
• RPM packages
• Debian packages
Note:
The tarball does not come with any scripts suitable for running Flume as a service or daemon. This
makes the tarball distribution appropriate for ad hoc installations and preliminary testing, but a more
complete installation is provided by the binary RPM and Debian packages.
$ cd /usr/local/lib
$ sudo tar -zxvf <path_to_flume-ng-(Flume_version)-cdh(CDH_version).tar.gz>
$ sudo mv flume-ng-(Flume_version)-cdh(CDH_version) flume-ng
For example,
$ cd /usr/local/lib
$ sudo tar -zxvf <path_to_flume-ng-1.4.0-cdh5.0.0.tar.gz>
$ sudo mv flume-ng-1.4.0-cdh5.0.0 flume-ng
2. To complete the configuration of a tarball installation, you must set your PATH variable to include the bin/
subdirectory of the directory where you installed Flume. For example:
$ export PATH=/usr/local/lib/flume-ng/bin:$PATH
You may also want to enable automatic start-up on boot. To do this, install the Flume agent.
To install the Flume agent so Flume starts automatically on boot on Ubuntu and other Debian systems:
To install the Flume agent so Flume starts automatically on boot on Red Hat-compatible systems:
To install the Flume agent so Flume starts automatically on boot on SLES systems:
Flume Configuration
Flume 1.x provides a template configuration file for flume.conf called conf/flume-conf.properties.template
and a template for flume-env.sh called conf/flume-env.sh.template.
1. Copy the Flume template property file conf/flume-conf.properties.template to conf/flume.conf,
then edit it as appropriate.
This is where you define your sources, sinks, and channels, and the flow within an agent. By default, the
properties file is configured to work out of the box using a sequence generator source, a logger sink, and a
memory channel. For information on configuring agent flows in Flume 1.x, as well as more details about the
supported sources, sinks and channels, see the documents listed under Viewing the Flume Documentation.
2. Optionally, copy the template flume-env.sh file conf/flume-env.sh.template to conf/flume-env.sh.
The flume-ng executable looks for a file named flume-env.sh in the conf directory, and sources it if it finds
it. Some use cases for using flume-env.sh are to specify a bigger heap size for the flume agent, or to specify
debugging or profiling options via JAVA_OPTS when developing your own custom Flume NG components
such as sources and sinks. If you do not make any changes to this file, then you need not perform the copy
as it is effectively empty by default.
$ flume-ng help
commands:
help display this help text
agent run a Flume agent
avro-client run an avro Flume client
version show Flume version info
global options:
--conf,-c <conf> use configs in <conf> directory
--classpath,-C <cp> append to the classpath
--dryrun,-d do not actually start Flume, just print the command
--Dproperty=value sets a JDK system property value
agent options:
--conf-file,-f <file> specify a config file (required)
--name,-n <name> the name of this agent (required)
--help,-h display help text
avro-client options:
--rpcProps,-P <file> RPC client properties file with server connection params
--host,-H <host> hostname to which events will be sent (required)
--port,-p <port> port of the avro source (required)
--dirname <dir> directory to stream to avro source
--filename,-F <file> text file to stream to avro source [default: std input]
--headerFile,-R <file> headerFile containing headers as key/value pairs on each new
line
--help,-h display help text
Note:
If Flume is not found and you installed Flume from a tarball, make sure that $FLUME_HOME/bin is in
your $PATH.
Running Flume
If Flume is installed via an RPM or Debian package, you can use the following commands to start, stop, and
restart the Flume agent via init scripts:
You can also run the agent in the foreground directly by using the flume-ng agent command:
For example:
Template of User Customizable /etc/flume-ng/conf/ If you want modify this file, copy it
flume-env.sh.template
environment file first and modify the copy
Sinks
Channels
Important:
Flume on-disk encryption operates with a maximum strength of 128-bit AES encryption unless the
JCE unlimited encryption cryptography policy files are installed. Please see this Oracle document for
information about enabling strong cryptography with JDK 1.6:
http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html
Consult your security organization for guidance on the acceptable strength of your encryption keys.
Cloudera has tested with AES-128, AES-192, and AES-256.
keytool -genseckey -alias key-1 -keyalg AES -keysize 128 -validity 9000 \
-keystore test.keystore -storetype jceks \
-storepass keyStorePassword
The command to generate a 128-bit key that uses a different password from that used by the key store is:
The key store and password files can be stored anywhere on the file system; both files should have flume as
the owner and 0600 permissions.
Please note that -keysize controls the strength of the AES encryption key, in bits; 128, 192, and 256 are the
allowed values.
Configuration
Flume on-disk encryption is enabled by setting parameters in the /etc/flume-ng/conf/flume.conf file.
Basic Configuration
The first example is a basic configuration with an alias called key-0 that uses the same password as the key
store:
agent.channels.ch-0.type = file
agent.channels.ch-0.capacity = 10000
agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING
agent.channels.ch-0.encryption.activeKey = key-0
agent.channels.ch-0.encryption.keyProvider = JCEKSFILE
agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile =
/path/to/my.keystore.password
agent.channels.ch-0.encryption.keyProvider.keys = key-0
In the next example, key-0 uses its own password which may be different from the key store password:
agent.channels.ch-0.type = file
agent.channels.ch-0.capacity = 10000
agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING
agent.channels.ch-0.encryption.activeKey = key-0
agent.channels.ch-0.encryption.keyProvider = JCEKSFILE
agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile =
/path/to/my.keystore.password
agent.channels.ch-0.encryption.keyProvider.keys = key-0
agent.channels.ch-0.encryption.keyProvider.keys.key-0.passwordFile =
/path/to/key-0.password
agent.channels.ch-0.type = file
agent.channels.ch-0.capacity = 10000
agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING
agent.channels.ch-0.encryption.activeKey = key-1
agent.channels.ch-0.encryption.keyProvider = JCEKSFILE
agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile =
/path/to/my.keystore.password
agent.channels.ch-0.encryption.keyProvider.keys = key-0 key-1
The same scenario except that key-0 and key-1 have their own passwords is shown here:
agent.channels.ch-0.type = file
agent.channels.ch-0.capacity = 10000
agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING
agent.channels.ch-0.encryption.activeKey = key-1
agent.channels.ch-0.encryption.keyProvider = JCEKSFILE
agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore
agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile =
/path/to/my.keystore.password
agent.channels.ch-0.encryption.keyProvider.keys = key-0 key-1
agent.channels.ch-0.encryption.keyProvider.keys.key-0.passwordFile =
/path/to/key-0.password
agent.channels.ch-0.encryption.keyProvider.keys.key-1.passwordFile =
/path/to/key-1.password
Troubleshooting
If the unlimited strength JCE policy files are not installed, an error similar to the following is printed in the
flume.log:
HBase Installation
Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS).
Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.
Next Steps
After installing and configuring HBase, check out the following topics about using HBase:
• Importing Data Into HBase
• Writing Data to HBase
• Reading Data from HBase
New Features and Changes for HBase in CDH 5
CDH 5.0.x and 5.1.x each include major upgrades to HBase. Each of these upgrades provides exciting new features,
as well as things to keep in mind when upgrading from a previous version.
For new features and changes introduced in older CDH 5 releases, skip to CDH 5.1 HBase Changes or CDH 5.0.x
HBase Changes.
CDH 5.4 HBase Changes
CDH 5.4 introduces HBase 1.0, which represents a major upgrade to HBase. This upgrade introduces new features
and moves some features which were previously marked as experimental to fully supported status. This overview
provides information about the most important features, how to use them, and where to find out more
information. Cloudera appreciates your feedback about these features.
Highly-Available Read Replicas
CDH 5.4 introduces highly-available read replicas. Using read replicas, clients can request, on a per-read basis,
a read result using a new consistency model, timeline consistency, rather than strong consistency. The read
request is sent to the RegionServer serving the region, but also to any RegionServers hosting replicas of the
region. The client receives the read from the fastest RegionServer to respond, and receives an indication of
whether the response was from the primary RegionServer or from a replica. See HBase Read Replicas for more
details.
MultiWAL Support
CDH 5.4 introduces support for writing multiple write-ahead logs (MultiWAL) on a given RegionServer, allowing
you to increase throughput when a region writes to the WAL. See Configuring MultiWAL Support.
Medium-Object (MOB) Storage
CDH 5.4 introduces a mechanism for storing objects between 100 KB and 10 MB in a default configuration, or
medium objects, directly in HBase. Storing objects up to 50 MB is possible with additional configuration. Previously,
storing these medium objects directly in HBase could degrade performance due to write amplification caused
by splits and compactions.
MOB storage requires HFile V3.
doAs Impersonation for the Thrift Gateway
Prior to CDH 5.4, the Thrift gateway could be configured to authenticate to HBase on behalf of the client as a
static user. A new mechanism, doAs Impersonation, allows the client to authenticate as any HBase user on a
per-call basis for a higher level of security and flexibility. See Configure doAs Impersonation for the HBase Thrift
Gateway.
Namespace Create Authorization
Prior to CDH 5.4, only global admins could create namespaces. Now, a Namespace Create authorization can be
assigned to a user, who can then create namespaces.
Authorization to List Namespaces and Tables
Prior to CDH 5.4, authorization checks were not performed on list namespace and list table operations, so you
could list the names of tany tables or namespaces, regardless of your authorization. In CDH 5.4, you are not able
to list namespaces or tables you do not have authorization to access.
Crunch API Changes for HBase
In CDH 5.4, Apache Crunch adds the following API changes for HBase:
• HBaseTypes.cells() was added to support serializing HBase Cell objects.
• Each method of HFileUtils now supports PCollection<C extends Cell>, which includes both
PCollection<KeyValue> and PCollection<Cell>, on their method signatures.
• HFileTarget, HBaseTarget, and HBaseSourceTarget each support any subclass of Cell as an output type.
HFileSource and HBaseSourceTarget still return KeyValue as the input type for backward-compatibility
with existing Crunch pipelines.
ZooKeeper 3.4 Is Required
HBase 1.0 requires ZooKeeper 3.4.
HBase API Changes for CDH 5.4
CDH 5.4.0 introduces HBase 1.0, which includes some major changes to the HBase APIs. Besides the changes
listed above, some APIs have been deprecated in favor of new public APIs.
• The HConnection API is deprecated in favor of Connection.
• The HConnectionManager API is deprecated in favor of ConnectionFactory.
• The HTable API is deprecated in favor of Table.
• The HTableAdmin API is deprecated in favor of Admin.
SlabCache, which was marked as deprecated in CDH 5.2, has been removed in CDH 5.3. To configure the
BlockCache, see Configuring the HBase BlockCache on page 260.
checkAndMutate(RowMutations) API
CDH 5.3 provides checkAndMutate(RowMutations), in addition to existing support for atomic checkAndPut as
well as checkAndDelete operations on individual rows (HBASE-11796).
CDH 5.2 HBase Changes
CDH 5.2 introduces HBase 0.98.6, which represents a minor upgrade to HBase. This upgrade introduces new
features and moves some features which were previously marked as experimental to fully supported status.
This overview provides information about the most important features, how to use them, and where to find out
more information. Cloudera appreciates your feedback about these features.
JAVA_HOME must be set in your environment.
HBase now requires JAVA_HOME to be set in your environment. If it is not set, HBase will fail to start and an error
will be logged. If you use Cloudera Manager, this is set automatically. If you use CDH without Cloudera Manager,
JAVA_HOME should be set up as part of the overall installation. See Java Development Kit Installation on page
41 for instructions on setting JAVA_HOME, as well as other JDK-specific instructions.
The default value for hbase.hstore.flusher.count has increased from 1 to 2.
The default value for hbase.hstore.flusher.count has been increased from one thread to two. This new
configuration can improve performance when writing to HBase under some workloads. However, for high IO
workloads two flusher threads can create additional contention when writing to HDFS. If after upgrading to CDH
5.2. you see an increase in flush times or performance degradation, lowering this value to 1 is recommended.
Use the RegionServer's advanced configuration snippet for hbase-site.xml if you use Cloudera Manager, or
edit the file directly otherwise.
The default value for hbase.hregion.memstore.block.multiplier has increased from 2 to 4.
The default value for hbase.hregion.memstore.block.multiplier has increased from 2 to 4, in order to
improve both throughput and latency. If you experience performance degradation due to this change, change
the value setting to 2, using the RegionServer's advanced configuration snippet for hbase-site.xml if you use
Cloudera Manager, or by editing the file directly otherwise.
SlabCache is deprecated, and BucketCache is now the default block cache.
CDH 5.1 provided full support for the BucketCache block cache. CDH 5.2 deprecates usage of SlabCache in favor
of BucketCache. To configure BucketCache, see BucketCache Block Cache on page 245
Changed Syntax of user_permissions Shell Command
The pattern-matching behavior for the user_permissions HBase Shell command has changed. Previously,
either of the following two commands would return permissions of all known users in HBase:
The first variant is no longer supported. The second variant is the only supported operation and also supports
passing in other Java regular expressions.
New Properties for IPC Configuration
If the Hadoop configuration is read after the HBase configuration, Hadoop's settings can override HBase's settings
if the names of the settings are the same. To avoid the risk of override, HBase has renamed the following settings
(by prepending 'hbase.') so that you can set them independent of your setting for Hadoop. If you do not use the
HBase-specific variants, the Hadoop settings will be used. If you have not experienced issues with your
configuration, there is no need to change it.
ipc.server.max.callqueue.size hbase.ipc.server.max.callqueue.size
ipc.server.max.callqueue.length hbase.ipc.server.max.callqueue.length
ipc.server.read.threadpool.size hbase.ipc.server.read.threadpool.size
ipc.server.tcpkeepalive hbase.ipc.server.tcpkeepalive
ipc.server.tcpnodelay hbase.ipc.server.tcpnodelay
ipc.client.call.purge.timeout hbase.ipc.client.call.purge.timeout
ipc.client.connection.maxidletime hbase.ipc.client.connection.maxidletime
ipc.client.idlethreshold hbase.ipc.client.idlethreshold
ipc.client.kill.max hbase.ipc.client.kill.max
Warning: These features are still considered experimental. Experimental features are not supported
and Cloudera does not recommend using them in production environments or with important data.
Visibility Labels
You can now specify a list of visibility labels, such as CONFIDENTIAL, TOPSECRET, or PUBLIC, at the cell level.
You can associate users with these labels to enforce visibility of HBase data. These labels can be grouped into
complex expressions using logical operators &, |, and ! (AND, OR, NOT). A given user is associated with a set of
visibility labels, and the policy for associating the labels is pluggable. A coprocessor,
org.apache.hadoop.hbase.security.visibility.DefaultScanLabelGenerator, checks for visibility labels
on cells that would be returned by a Get or Scan and drops the cells that a user is not authorized to see, before
returning the results. The same coprocessor saves visibility labels as tags, in the HFiles alongside the cell data,
when a Put operation includes visibility labels. You can specify custom implementations of ScanLabelGenerator
by setting the property hbase.regionserver.scan.visibility.label.generator.class to a
comma-separated list of classes in hbase-site.xml. To edit the configuration, use an Advanced Configuration
Snippet if you use Cloudera Manager, or edit the file directly otherwise.
No labels are configured by default. You can add a label to the system using either the
VisibilityClient#addLabels() API or the add_label shell command. Similar APIs and shell commands are
provided for deleting labels and assigning them to users. Only a user with superuser access (the hbase.superuser
access level) can perform these operations.
To assign a visibility label to a cell, you can label the cell using the API method
Mutation#setCellVisibility(new CellVisibility(<labelExp>));. An API is provided for managing
visibility labels, and you can also perform many of the operations using HBase Shell.
Previously, visibility labels could not contain the symbols &, |, !, ( and ), but this is no longer the case.
For more information about visibility labels, see the Visibility Labels section of the Apache HBase Reference
Guide.
If you use visibility labels along with access controls, you must ensure that the Access Controller is loaded before
the Visibility Controller in the list of coprocessors. This is the default configuration. See HBASE-11275.
Visibility labels are an experimental feature introduced in CDH 5.1, and still experimental in CDH 5.2.
Transparent Server-Side Encryption
Transparent server-side encryption can now be enabled for both HFiles and write-ahead logs (WALs), to protect
their contents at rest. To configure transparent encryption, first create an encryption key, then configure the
appropriate settings in hbase-site.xml . To edit the configuration, use an Advanced Configuration Snippet if
you use Cloudera Manager, or edit the file directly otherwise. See the Transparent Encryption section in the
Apache HBase Reference Guide for more information.
Transparent server-side encryption is an experimental feature introduced in CDH 5.1, and still experimental in
CDH 5.2.
Stripe Compaction
Stripe compaction is a compaction scheme that segregates the data inside a region by row key, creating "stripes"
of data which are visible within the region but transparent to normal operations. This striping improves read
performance in common scenarios and greatly reduces variability by avoiding large and/or inefficient compactions.
Configuration guidelines and more information are available at Stripe Compaction.
To configure stripe compaction for a single table from within the HBase shell, use the following syntax.
To configure stripe compaction for a column family from within the HBase shell, use the following syntax.
alter <table>, {NAME => <column family>, CONFIGURATION => {<setting => <value>}}
Example: alter 'logs', {NAME => 'blobs', CONFIGURATION =>
{'hbase.store.stripe.fixed.count' => 10}}
Stripe compaction is an experimental feature in CDH 5.1, and still experimental in CDH 5.2.
Distributed Log Replay
After a RegionServer fails, its failed region is assigned to another RegionServer, which is marked as "recovering"
in ZooKeeper. A SplitLogWorker directly replays edits from the WAL of the failed RegionServer to the region at
its new location. When a region is in "recovering" state, it can accept writes but no reads (including Append and
Increment), region splits or merges. Distributed Log Replay extends the distributed log splitting framework. It
works by directly replaying WAL edits to another RegionServer instead of creating recovered.edits files.
Distributed log replay provides the following advantages over using the current distributed log splitting
functionality on its own.
• It eliminates the overhead of writing and reading a large number of recovered.edits files. It is not unusual
for thousands of recovered.edits files to be created and written concurrently during a RegionServer recovery.
Many small random writes can degrade overall system performance.
• It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to
accept writes again.
To enable distributed log replay, set hbase.master.distributed.log.replay to true. in hbase-site.xml.
To edit the configuration, use an Advanced Configuration Snippet if you use Cloudera Manager, or edit the file
directly otherwise.You must also enable HFile version 3. Distributed log replay is unsafe for rolling upgrades.
Distributed log replay is an experimental feature in CDH 5.1, and still experimental in CDH 5.2.
CDH 5.1 HBase Changes
CDH 5.1 introduces HBase 0.98, which represents a major upgrade to HBase. This upgrade introduces several
new features, including a section of features which are considered experimental and should not be used in a
production environment. This overview provides information about the most important features, how to use
them, and where to find out more information. Cloudera appreciates your feedback about these features.
In addition to HBase 0.98, Cloudera has pulled in changes from HBASE-10883, HBASE-10964, HBASE-10823,
HBASE-10916, and HBASE-11275. Implications of these changes are detailed below and in the Release Notes.
BucketCache Block Cache
A new offheap BlockCache implementation, BucketCache, was introduced as an experimental feature in CDH 5
Beta 1, and is now fully supported in CDH 5.1. BucketCache can be used in either of the following two
configurations:
• As a CombinedBlockCache with both onheap and offheap caches.
• As an L2 cache for the default onheap LruBlockCache
BucketCache requires less garbage-collection than SlabCache, which is the other offheap cache implementation
in HBase. It also has many optional configuration settings for fine-tuning. All available settings are documented
in the API documentation for CombinedBlockCache. Following is a simple example configuration.
1. First, edit hbase-env.sh and set -XX:MaxDirectMemorySize to the total size of the desired onheap plus
offheap, in this case, 5 GB (but expressed as 5G). To edit the configuration, use an Advanced Configuration
Snippet if you use Cloudera Manager, or edit the file directly otherwise.
-XX:MaxDirectMemorySize=5G
2. Next, add the following configuration to hbase-site.xml. To edit the configuration, use an Advanced
Configuration Snippet if you use Cloudera Manager, or edit the file directly otherwise. This configuration uses
80% of the -XX:MaxDirectMemorySize (4 GB) for offheap, and the remainder (1 GB) for onheap.
<property>
<name>hbase.bucketcache.ioengine</name>
<value>offheap</value>
</property>
<property>
<name>hbase.bucketcache.percentage.in.combinedcache</name>
<value>0.8</value>
</property>
<property>
<name>hbase.bucketcache.size</name>
<value>5120</value>
</property>
3. Restart or rolling restart your cluster for the configuration to take effect.
Access Control for EXEC Permissions
A new access control level has been added to check whether a given user has EXEC permission. This can be
specified at the level of the cluster, table, row, or cell.
To use EXEC permissions, perform the following procedure.
• Install the AccessController coprocessor either as a system coprocessor or on a table as a table coprocessor.
• Set the hbase.security.exec.permission.checks configuration setting in hbase-site.xml to true To
edit the configuration, use an Advanced Configuration Snippet if you use Cloudera Manager, or edit the file
directly otherwise..
For more information on setting and revoking security permissions, see the Access Control section of the Apache
HBase Reference Guide.
Reverse Scan API
A reverse scan API has been introduced. This allows you to scan a table in reverse. Previously, if you wanted to
be able to access your data in either direction, you needed to store the data in two separate tables, each ordered
differently. This feature was implemented in HBASE-4811.
To use the reverse scan feature, use the new Scan.setReversed(boolean reversed) API. If you specify a
startRow and stopRow, to scan in reverse, the startRow needs to be lexicographically after the stopRow. See
the Scan API documentation for more information.
MapReduce Over Snapshots
You can now run a MapReduce job over a snapshot from HBase, rather than being limited to live data. This
provides the ability to separate your client-side work load from your live cluster if you need to run
resource-intensive MapReduce jobs and can tolerate using potentially-stale data. You can either run the
MapReduce job on the snapshot within HBase, or export the snapshot and run the MapReduce job against the
exported file.
Running a MapReduce job on an exported file outside of the scope of HBase relies on the permissions of the
underlying filesystem and server, and bypasses ACLs, visibility labels, and encryption that may otherwise be
provided by your HBase cluster.
A new API, TableSnapshotInputFormat, is provided. For more information, see TableSnapshotInputFormat.
MapReduce over snapshots was introduced in CDH 5.0.
Stateless Streaming Scanner over REST
A new stateless streaming scanner is available over the REST API. Using this scanner, clients do not need to
restart a scan if the REST server experiences a transient failure. All query parameters are specified during the
REST request. Query parameters include startrow, endrow, columns, starttime, endtime, maxversions,
batchtime, and limit. Following are a few examples of using the stateless streaming scanner.
For full details about the stateless streaming scanner, see the API documentation for this feature.
Delete Methods of Put Class Now Use Constructor Timestamps
The Delete() methods of the Put class of the HBase Client API previously ignored the constructor's timestamp,
and used the value of HConstants.LATEST_TIMESTAMP. This behavior was different from the behavior of the
add() methods. The Delete() methods now use the timestamp from the constructor, creating consistency in
behavior across the Put class. See HBASE-10964.
Experimental Features
Warning: These features are still considered experimental. Experimental features are not supported
and Cloudera does not recommend using them in production environments or with important data.
Visibility Labels
You can now specify a list of visibility labels, such as CONFIDENTIAL, TOPSECRET, or PUBLIC, at the cell level.
You can associate users with these labels to enforce visibility of HBase data. These labels can be grouped into
complex expressions using logical operators &, |, and ! (AND, OR, NOT). A given user is associated with a set of
visibility labels, and the policy for associating the labels is pluggable. A coprocessor,
org.apache.hadoop.hbase.security.visibility.DefaultScanLabelGenerator, checks for visibility labels
on cells that would be returned by a Get or Scan and drops the cells that a user is not authorized to see, before
returning the results. The same coprocessor saves visibility labels as tags, in the HFiles alongside the cell data,
when a Put operation includes visibility labels. You can specify custom implementations of ScanLabelGenerator
by setting the property hbase.regionserver.scan.visibility.label.generator.class to a
comma-separated list of classes.
No labels are configured by default. You can add a label to the system using either the
VisibilityClient#addLabels() API or the add_label shell command. Similar APIs and shell commands are
provided for deleting labels and assigning them to users. Only a user with superuser access (the hbase.superuser
access level) can perform these operations.
To assign a visibility label to a cell, you can label the cell using the API method
Mutation#setCellVisibility(new CellVisibility(<labelExp>));.
Visibility labels and request authorizations cannot contain the symbols &, |, !, ( and ) because they are reserved
for constructing visibility expressions. See HBASE-10883.
For more information about visibility labels, see the Visibility Labels section of the Apache HBase Reference
Guide.
If you use visibility labels along with access controls, you must ensure that the Access Controller is loaded before
the Visibility Controller in the list of coprocessors. This is the default configuration. See HBASE-11275.
In order to use per-cell access controls or visibility labels, you must use HFile version 3. To enable HFile version
3, add the following to hbase-site.xml, using an advanced code snippet if you use Cloudera Manager, or directly
to the file if your deployment is unmanaged.. Changes will take effect after the next major compaction.
<property>
<name>hfile.format.version</name>
<value>3</value>
</property>
<property>
<name>hfile.format.version</name>
<value>3</value>
</property>
To configure stripe compaction for a column family from within the HBase shell, use the following syntax.
alter <table>, {NAME => <column family>, CONFIGURATION => {<setting => <value>}}
Example: alter 'logs', {NAME => 'blobs', CONFIGURATION =>
{'hbase.store.stripe.fixed.count' => 10}}
without needing any source code modifications. This cannot be guaranteed however, since with the conversion
to Protocol Buffers (ProtoBufs), some relatively obscure APIs have been removed. Rudimentary efforts have
also been made to preserve recompile compatibility with advanced APIs such as Filters and Coprocessors. These
advanced APIs are still evolving and our guarantees for API compatibility are weaker here.
For information about changes to custom filters, see Custom Filters.
As of 0.96, the User API has been marked and all attempts at compatibility in future versions will be made. A
version of the javadoc that only contains the User API can be found here.
HBase Metrics Changes
HBase provides a metrics framework based on JMX beans. Between HBase 0.94 and 0.96, the metrics framework
underwent many changes. Some beans were added and removed, some metrics were moved from one bean to
another, and some metrics were renamed or removed. Click here to download the CSV spreadsheet which
provides a mapping.
Custom Filters
If you used custom filters written for HBase 0.94, you need to recompile those filters for HBase 0.96. The custom
filter must be altered to fit with the newer interface that uses protocol buffers. Specifically two new methods,
toByteArray(…) and parseFrom(…), which are detailed in detailed in the Filter API. These should be used
instead of the old methods write(…) and readFields(…), so that protocol buffer serialization is used. To see
what changes were required to port one of HBase's own custom filters, see the Git commit that represented
porting the SingleColumnValueFilter filter.
Checksums
In CDH 4, HBase relied on HDFS checksums to protect against data corruption. When you upgrade to CDH 5,
HBase checksums are now turned on by default. With this configuration, HBase reads data and then verifies
the checksums. Checksum verification inside HDFS will be switched off. If the HBase-checksum verification fails,
then the HDFS checksums are used instead for verifying data that is being read from storage. Once you turn on
HBase checksums, you will not be able to roll back to an earlier HBase version.
You should see a modest performance gain after setting hbase.regionserver.checksum.verify to true for
data that is not already present in the RegionServer's block cache.
To enable or disable checksums, modify the following configuration properties in hbase-site.xml. To edit the
configuration, use an Advanced Configuration Snippet if you use Cloudera Manager, or edit the file directly
otherwise.
<property>
<name>hbase.regionserver.checksum.verify</name>
<value>true</value>
<description>
If set to true, HBase will read data and then verify checksums for
hfile blocks. Checksum verification inside HDFS will be switched off.
If the hbase-checksum verification fails, then it will switch back to
using HDFS checksums.
</description>
</property>
The default value for the hbase.hstore.checksum.algorithm property has also changed to CRC32. Previously,
Cloudera advised setting it to NULL due to performance issues which are no longer a problem.
<property>
<name>hbase.hstore.checksum.algorithm</name>
<value>CRC32</value>
<description>
Name of an algorithm that is used to compute checksums. Possible values
are NULL, CRC32, CRC32C.
</description>
</property>
Upgrading HBase
Note: To see which version of HBase is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Important: Before you start, make sure you have read and understood the previous section, New
Features and Changes for HBase in CDH 5 on page 240, and check the Known Issues in CDH 5 and
Incompatible Changes for HBase.
Tables Processed:
hdfs://localhost:41020/myHBase/.META.
hdfs://localhost:41020/myHBase/usertable
hdfs://localhost:41020/myHBase/TestTable
hdfs://localhost:41020/myHBase/t
Count of HFileV1: 2
HFileV1:
hdfs://localhost:41020/myHBase/usertable
/fa02dac1f38d03577bd0f7e666f12812/family/249450144068442524
hdfs://localhost:41020/myHBase/usertable
/ecdd3eaee2d2fcf8184ac025555bb2af/family/249450144068442512
hdfs://localhost:41020/myHBase/usertable/fa02dac1f38d03577bd0f7e666f12812/family/1
Count of Regions with HFileV1: 2
Regions to Major Compact:
hdfs://localhost:41020/myHBase/usertable/fa02dac1f38d03577bd0f7e666f12812
hdfs://localhost:41020/myHBase/usertable/ecdd3eaee2d2fcf8184ac025555bb2af
In the example above, you can see that the script has detected two HFile v1 files, one corrupt file and the
regions to major compact.
By default, the script scans the root directory, as defined by hbase.rootdir. To scan a specific directory,
use the --dir option. For example, the following command scans the /myHBase/testTable directory.
2. Trigger a major compaction on each of the reported regions. This major compaction rewrites the files from
HFile v1 to HFile v2 format. To run the major compaction, start HBase Shell and issue the major_compact
command.
$ /usr/lib/hbase/bin/hbase shell
hbase> major_compact 'usertable'
You can also do this in a single step by using the echo shell built-in command.
3. Once all the HFileV1 files have been rewritten, running the upgrade script with the -check option again will
return a "No HFile v1 found" message. It is then safe to proceed with the upgrade.
Step 2: Gracefully shut down CDH 4 HBase cluster
Shut down your CDH 4 HBase cluster before you run the upgrade script in -execute mode.
To shut down HBase gracefully:
1. Stop the REST and Thrift server and clients, then stop the cluster.
a. Stop the Thrift server and clients:
b. Stop the cluster by shutting down the master and the region servers:
a. Use the following command on the master node:
Step 3: Uninstall the old version of HBase and replace it with the new version.
Warning:
If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get
purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that
apt-get purge removes all your configuration data. If you have modified any configuration files,
DO NOT PROCEED before backing them up.
2. Follow the instructions for installing the new version of HBase at HBase Installation on page 239.
Step 4: Run the HBase upgrade script in -execute mode
Important: Before you proceed with Step 4, upgrade your CDH 4 cluster to CDH 5. See Upgrading to
CDH 5 on page 574 for instructions.
This step executes the actual upgrade process. It has a verification step which checks whether or not the Master,
RegionServer and backup Master znodes have expired. If not, the upgrade is aborted. This ensures no upgrade
occurs while an HBase process is still running. If your upgrade is aborted even after shutting down the HBase
cluster, retry after some time to let the znodes expire. Default znode expiry time is 300 seconds.
As mentioned earlier, ZooKeeper and HDFS should be available. If ZooKeeper is managed by HBase, then use
the following command to start ZooKeeper.
….
Successfully completed Znode upgrade
Starting Log splitting
…
Successfully completed Log splitting
The output of the -execute command can either return a success message as in the example above, or, in case
of a clean shutdown where no log splitting is required, the command would return a "No log directories to split,
returning" message. Either of those messages indicates your upgrade was successful.
Warning: Do not move datafiles manually, as this can cause data corruption that requires manual
intervention to fix.
In order to prevent upgrade failures because of unexpired znodes, is there a way to check/force this before an
upgrade?
The upgrade script "executes" the upgrade when it is run with the -execute option. As part of the first step, it
checks for any live HBase processes (RegionServer, Master and backup Master), by looking at their znodes. If
any such znode is still up, it aborts the upgrade and prompts the user to stop such processes, and wait until
their znodes have expired. This can be considered an inbuilt check.
The -check option has a different use case: To check for HFile v1 files. This option is to be run on live CDH 4
clusters to detect HFile v1 and major compact any regions with such files.
Important: Rolling upgrade is not supported between a CDH 5 Beta release and a CDH 5 GA release.
Cloudera recommends using Cloudera Manager if you need to do rolling upgrades.
b. Stop the cluster by shutting down the master and the region servers:
• Use the following command on the master node:
Note: You may want to take this opportunity to upgrade ZooKeeper, but you do not have to upgrade
Zookeeper before upgrading HBase; the new version of HBase will run with the older version of
Zookeeper. For instructions on upgrading ZooKeeper, see Upgrading ZooKeeper from an Earlier CDH
5 Release on page 424.
To install the new version of HBase, follow directions in the next section, HBase Installation on page 239.
Installing HBase
To install HBase On Red Hat-compatible systems:
Note: See also Starting HBase in Standalone Mode on page 264, Configuring HBase in
Pseudo-Distributed Mode on page 266, and Deploying HBase on a Cluster on page 268 for more
information on configuring HBase for different modes.
$ dpkg -L hbase
You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard.
(To learn more, run man hier).
You are now ready to enable the server daemons you want to use with Hadoop. You can also enable Java-based
client access by adding the JAR files in /usr/lib/hbase/ and /usr/lib/hbase/lib/ to your Java class path.
Configuration Settings for HBase
This section contains information on configuring the Linux host and HDFS for HBase.
Using DNS with HBase
HBase uses the local hostname to report its IP address. Both forward and reverse DNS resolving should work.
If your server has multiple interfaces, HBase uses the interface that the primary hostname resolves to. If this
is insufficient, you can set hbase.regionserver.dns.interface in the hbase-site.xml file to indicate the
primary interface. To work properly, this setting requires that your cluster configuration is consistent and every
host has the same network interface configuration. As an alternative, you can set
hbase.regionserver.dns.nameserver in the hbase-site.xml file to use a different DNS name server than
the system-wide default.
Using the Network Time Protocol (NTP) with HBase
The clocks on cluster members must be synchronized for your cluster to function correctly. Some skew is
tolerable, but excessive skew could generate odd behaviors. Run NTP or another clock synchronization mechanism
on your cluster. If you experience problems querying data or unusual cluster operations, verify the system time.
For more information about NTP, see the NTP website.
Setting User Limits for HBase
Because HBase is a database, it opens many files at the same time. The default setting of 1024 for the maximum
number of open files on most Unix-like systems is insufficient. Any significant amount of loading will result in
failures and cause error message such as java.io.IOException...(Too many open files) to be logged
in the HBase or HDFS log files. For more information about this issue, see the Apache HBase Book. You may
also notice errors such as:
Another setting you should configure is the number of processes a user is permitted to start. The default number
of processes is typically 1024. Consider raising this value if you experience OutOfMemoryException errors.
Note:
• Only the root user can edit this file.
• If this change does not take effect, check other configuration files in the
/etc/security/limits.d/ directory for lines containing the hdfs or hbase user and the nofile
value. Such entries may be overriding the entries in /etc/security/limits.conf.
To apply the changes in /etc/security/limits.conf on Ubuntu and Debian systems, add the following line
in the /etc/pam.d/common-session file:
For more information on the ulimit command or per-user operating system limits, refer to the documentation
for your operating system.
Using dfs.datanode.max.transfer.threads with HBase
A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The
upper bound is controlled by the dfs.datanode.max.transfer.threads property (the property is spelled in
the code exactly as shown here). Before loading, make sure you have configured the value for
dfs.datanode.max.transfer.threads in the conf/hdfs-site.xml file (by default found in
/etc/hadoop/conf/hdfs-site.xml) to at least 4096 as shown below:
<property>
<name>dfs.datanode.max.transfer.threads</name>
<value>4096</value>
</property>
Restart HDFS after changing the value for dfs.datanode.max.transfer.threads. If the value is not set to
an appropriate value, strange failures can occur and an error message about exceeding the number of transfer
threads will be added to the DataNode logs. Other error messages about missing blocks are also logged, such
as:
org.apache.hadoop.ipc.RemoteException(java.io.IOException):
/tmp/output/f/5237a8430561409bb641507f0c531448 can't be moved into an encryption zone.
You can also choose to only encrypt specific column families, which encrypts individual HFiles while leaving
others unencrypted, using HBase Transparent Encryption at Rest. This provides a balance of data security and
performance.
Using Hedged Reads
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
Note:
To enable hedged reads for HBase, edit the hbase-site.xml file on each server. Set
dfs.client.hedged.read.threadpool.size to the number of threads to dedicate to running
hedged threads, and set the dfs.client.hedged.read.threshold.millis configuration property
to the number of milliseconds to wait before starting a second read against a different block replica.
Set dfs.client.hedged.read.threadpool.size to 0 or remove it from the configuration to disable
the feature. After changing these properties, restart your cluster.
The following is an example configuration for hedged reads for HBase.
<property>
<name>dfs.client.hedged.read.threadpool.size</name>
<value>20</value> <!-- 20 threads -->
</property>
<property>
<name>dfs.client.hedged.read.threshold.millis</name>
<value>10</value> <!-- 10 milliseconds -->
</property>
$ hbase shell
#!/bin/bash
echo 'list' | hbase shell -n
status=$?
if [$status -ne 0]; then
echo "The command may have failed."
fi
Successful HBase Shell commands return an exit status of 0. However, an exit status other than 0 does not
necessarily indicate a failure, but should be interpreted as unknown.. For example, a command may succeed,
but while waiting for the response, the client may lose connectivity. In that case, the client has no way to know
the outcome of the command. In the case of a non-zero exit status, your script should check to be sure the
command actually failed before taking further action.
You can also write Ruby scripts for use with HBase Shell. Example Ruby scripts are included in the
hbase-examples/src/main/ruby/ directory.
For merging regions that are not adjacent, passing true as the third parameter will force the merge.
Troubleshooting HBase
See Troubleshooting HBase.
Configuring the BlockCache
See Configuring the HBase BlockCache on page 260.
Configuring the HBase BlockCache
In the default configuration, HBase uses a single on-heap cache. If you configure the off-heap BucketCache,
the on-heap cache is used for Bloom filters and indexes,and the off-heap BucketCache is used to cache data
blocks. This is referred to as the Combined Blockcache configuration. The Combined BlockCache allows you to
use a larger in-memory cache while reducing the negative impact of garbage collection in the heap, because
HBase manages the BucketCache, rather than relying on the garbage collector.
Contents of the BlockCache
In order to size the BlockCache correctly, you need to understand what HBase places into it.
• Your data: Each time a Get or Scan operation occurs, the result is added to the BlockCache if it was not already
cached there. If you use the BucketCache, data blocks are always cached in the BucketCache.
• Row keys: When a value is loaded into the cache, its row key is also cached. This is one reason that it is
important to make your row keys as small as possible. A larger row key takes up more space in the cache.
• hbase:meta: The hbase:meta catalog table keeps track of which RegionServer is serving which regions. It
can consume several megabytes of cache if you have a large number of regions. It is given in-memory access
priority, which means HBase attempts to keep it in the cache as long as possible.
• Indexes of HFiles: HBase stores its data in HDFS in a format called HFile. These HFiles contain indexes which
allow HBase to seek for data within them without needing to open the entire HFile. The size of an index is a
factor of the block size, the size of your row keys, and the amount of data you are storing. For big data sets,
the size can exceed 1 GB per RegionServer, although it is unlikely that the entire index will be in the cache at
the same time. If you use the BucketCache, indexes are always cached on-heap.
• Bloom filters: If you use Bloom filters, they are stored in the BlockCache. If you use the BucketCache, Bloom
filters are always cached on-heap.
The sum of the sizes of these objects is highly dependent on your usage patterns and the characteristics of your
data. For this reason, the HBase Web UI and Cloudera Manager each expose several metrics to help you size
and tune the BlockCache.
Deciding Whether To Use the BucketCache
The HBase team has published the results of exhaustive BlockCache testing, which revealed the following
guidelines.
• If the result of a Get or Scan typically fits completely in the heap, the default configuration, which uses the
on-heap LruBlockCache, is the best choice, as the L2 cache will not provide much benefit. If the eviction
rate is low, garbage collection can be 50% less than that of the BucketCache, and throughput can be at least
20% higher.
• Otherwise, if your cache is experiencing a consistently high eviction rate, use the BucketCache, which causes
30-50% of the garbage collection of LruBlockCache when the eviction rate is high.
• BucketCache using file mode on solid-state disks has a better garbage-collection profile but lower throughput
than BucketCache using off-heap memory.
Bypassing the BlockCache
If the data needed for a specific but atypical operation does not all fit in memory, using the BlockCache can be
counter-productive, because data that you are still using may be evicted, or even if other data is not evicted,
excess garbage collection can adversely effect performance. For this type of operation, you may decide to bypass
the BlockCache. To bypass the BlockCache for a given Scan or Get, use the setCacheBlocks(false) method.
In addition, you can prevent a specific column family's contents from being cached, by setting its BLOCKCACHE
configuration to false. Use the following syntax in HBase Shell:
To use the Java API to configure a column family for in-memory access, use the
HColumnDescriptor.setInMemory(true) method.
allows 1% of heap to be available as a "working area" for evicting items from the cache. If you use the BucketCache,
the on-heap LruBlockCache only stores indexes and Bloom filters, and data blocks are cached in the off-heap
BucketCache.
number of RegionServers * heap size * hfile.block.cache.size * 0.99
To tune the size of the LruBlockCache, you can add RegionServers or increase the total Java heap on a given
RegionServer to increase it, or you can tune hfile.block.cache.size to reduce it. Reducing it will cause cache
evictions to happen more often, but will reduce the time it takes to perform a cycle of garbage collection. Increasing
the heap will cause garbage collection to take longer but happen less frequently.
About the off-heap BucketCache
If the BucketCache is enabled, it stores data blocks, leaving the on-heap cache free for storing indexes and Bloom
filters. The physical location of the BucketCache storage can be either in memory (off-heap) or in a file stored
in a fast disk.
• Off-heap: This is the default configuration.
• File-based: You can use the file-based storage mode to store the BucketCache on an SSD or FusionIO device,
Starting in CDH 5.4 (HBase 1.0), it is possible to configure a column family to keep its data blocks in the L1 cache
instead of the BucketCache, using the HColumnDescriptor.cacheDataInL1(true) method or by using the
following syntax in HBase Shell:
<property>
<name>hbase.bucketcache.ioengine</name>
<value>offheap</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.2</value>
</property>
<property>
<name>hbase.bucketcache.size</name>
<value>4196</value>
</property>
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
1. First, verify the RegionServer's off-heap size, and if necessary, tune it by editing the hbase-env.sh file and
adding a line like the following:
HBASE_OFFHEAPSIZE=5G
Set it to a value which will accommodate your desired L2 cache size, in addition to space reserved for cache
management.
2. Next, configure the properties in Table 30: BucketCache Configuration Properties on page 262 as appropriate,
using the example below as a model.
<property>
<name>hbase.bucketcache.ioengine</name>
<value>offheap</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.2</value>
</property>
<property>
<name>hbase.bucketcache.size</name>
<value>4196</value>
</property>
Note:
You can skip this section if you are already running HBase in distributed or pseudo-distributed mode.
By default, HBase ships configured for standalone mode. In this mode of operation, a single JVM hosts the HBase
Master, an HBase Region Server, and a ZooKeeper quorum peer. HBase stores your data in a location on the
local filesystem, rather than using HDFS. Standalone mode is only appropriate for initial testing.
Important:
If you have configured High Availability for the NameNode (HA), you cannot deploy HBase in standalone
mode without modifying the default configuration, because both the standalone HBase process and
ZooKeeper (required by HA) will try to bind to port 2181. You can configure a different port for
ZooKeeper, but in most cases it makes more sense to deploy HBase in distributed mode in an HA
cluster.
In order to run HBase in standalone mode, you must install the HBase Master package.
Installing the HBase Master
To install the HBase Master on Red Hat-compatible systems:
• On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed.
To verify that the standalone installation is operational, visit http://localhost:60010. The list of RegionServers
at the bottom of the page should include one entry for your local machine.
Note:
Although you have only started the master process, in standalone mode this same process is also
internally running a region server and a ZooKeeper peer. In the next section, you will break out these
components into separate JVMs.
If you see this message when you start the HBase standalone master:
you will need to stop the hadoop-zookeeper-server (or zookeeper-server) or uninstall the
hadoop-zookeeper-server (or zookeeper) package.
See also Accessing HBase by using the HBase Shell on page 269, Using MapReduce with HBase on page 270 and
Troubleshooting on page 271.
Installing and Starting the HBase Thrift Server
To install Thrift on Red Hat-compatible systems:
You can now use the service command to start the Thrift server:
You can use the service command to run an init.d script, /etc/init.d/hbase-rest, to start the REST
server; for example:
The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other
applications running on the same host.
If you need change the port for the REST server, configure it in hbase-site.xml, for example:
<property>
<name>hbase.rest.port</name>
<value>60050</value>
</property>
Note:
You can use HBASE_REST_OPTS in hbase-env.sh to pass other settings (such as heap size and GC
parameters) to the REST server JVM.
Note: You can skip this section if you are already running HBase in distributed mode, or if you intend
to use only standalone mode.
Pseudo-distributed mode differs from standalone mode in that each of the component processes run in a
separate JVM. It differs from distributed mode in that each of the separate processes run on the same server,
rather than multiple servers in a cluster. This section also assumes you wish to store your HBase data in HDFS
rather than on the local filesystem.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://myhost:8020/hbase</value>
</property>
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> hadoop <command>;
they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are
using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then,
for each command executed by this user, $ <command>
$ sudo jps
32694 Jps
30674 HRegionServer
29496 HMaster
28781 DataNode
28422 NameNode
30348 QuorumPeerMain
You should also be able to navigate to http://localhost:60010 and verify that the local RegionServer has
registered with the Master.
Installing and Starting the HBase Thrift Server
The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the
HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is
multiplatform and more performant than REST in many situations. Thrift can be run collocated along with the
region servers, but should not be collocated with the NameNode or the JobTracker. For more information about
Thrift, visit http://incubator.apache.org/thrift/.
To enable the HBase Thrift Server On Red Hat-compatible systems:
See also Accessing HBase by using the HBase Shell on page 269, Using MapReduce with HBase on page 270 and
Troubleshooting on page 271.
Deploying HBase on a Cluster
After you have HBase running in pseudo-distributed mode, the same configuration can be extended to running
on a distributed cluster.
<property>
<name>hbase.zookeeper.quorum</name>
<value>mymasternode</value>
</property>
The hbase.zookeeper.quorum property is a comma-separated list of hosts on which ZooKeeper servers are
running. If one of the ZooKeeper servers is down, HBase will use another from the list. By default, the ZooKeeper
service is bound to port 2181. To change the port, add the hbase.zookeeper.property.clientPort property
to hbase-site.xml and set the value to the port you want ZooKeeper to use. For more information, see this
chapter of the Apache HBase Reference Guide.
To start the cluster, start the services in the following order:
1. The ZooKeeper Quorum Peer
2. The HBase Master
3. Each of the HBase RegionServers
After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that
each of the slave nodes has registered properly with the master.
See also Accessing HBase by using the HBase Shell on page 269, Using MapReduce with HBase on page 270 and
Troubleshooting on page 271. For instructions on improving the performance of local reads, see Improving
Performance.
Accessing HBase by using the HBase Shell
After you have started HBase, you can access the database in an interactive way by using the HBase Shell, which
is a command interpreter for HBase which is written in Ruby.
$ hbase shell
#!/bin/bash
echo 'list' | hbase shell -n
status=$?
if [$status -ne 0]; then
echo "The command may have failed."
fi
Successful HBase Shell commands return an exit status of 0. However, an exit status other than 0 does not
necessarily indicate a failure, but should be interpreted as unknown.. For example, a command may succeed,
but while waiting for the response, the client may lose connectivity. In that case, the client has no way to know
the outcome of the command. In the case of a non-zero exit status, your script should check to be sure the
command actually failed before taking further action.
You can also write Ruby scripts for use with HBase Shell. Example Ruby scripts are included in the
hbase-examples/src/main/ruby/ directory.
For merging regions that are not adjacent, passing true as the third parameter will force the merge.
TableMapReduceUtil.addDependencyJars(job);
This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you
do not need to edit the MapReduce configuration.
You can find more information about addDependencyJars in the documentation listed under Viewing the HBase
Documentation on page 271.
When getting an Configuration object for a HBase MapReduce job, instantiate it using the
HBaseConfiguration.create() method.
Troubleshooting
The Cloudera HBase packages have been configured to place logs in /var/log/hbase. Cloudera recommends
tailing the .log files in this directory when you start HBase to check for any error messages or failures.
Table Creation Fails after Installing LZO
If you install LZO after starting the Region Server, you will not be able to create a table with LZO compression
until you re-start the Region Server.
Why this happens
When the Region Server starts, it runs CompressionTest and caches the results. When you try to create a table
with a given form of compression, it refers to those results. You have installed LZO since starting the Region
Server, so the cached results, which pre-date LZO, cause the create to fail.
What to do
Restart the Region Server. Now table creation with LZO will succeed.
Thrift Server Crashes after Receiving Invalid Data
The Thrift server may crash if it receives a large amount of invalid data, due to a buffer overrun.
Why this happens
The Thrift server allocates memory to check the validity of data it receives. If it receives a large amount of invalid
data, it may need to allocate more memory than is available. This is due to a limitation in the Thrift library itself.
What to do
To prevent the possibility of crashes due to buffer overruns, use the framed and compact transport protocols.
These protocols are disabled by default, because they may require changes to your client code. The two options
to add to your hbase-site.xml are hbase.regionserver.thrift.framed and
hbase.regionserver.thrift.compact. Set each of these to true, as in the XML below. You can also specify
the maximum frame size, using the hbase.regionserver.thrift.framed.max_frame_size_in_mb option.
<property>
<name>hbase.regionserver.thrift.framed</name>
<value>true</value>
</property>
<property>
<name>hbase.regionserver.thrift.framed.max_frame_size_in_mb</name>
<value>2</value>
</property>
<property>
<name>hbase.regionserver.thrift.compact</name>
<value>true</value>
</property>
HCatalog Installation
As of CDH 5, HCatalog is part of Apache Hive.
HCatalog provides table data access for CDH components such as Pig, Sqoop, and MapReduce. Table definitions
are maintained in the Hive metastore, which HCatalog requires. HCatalog makes the same table information
available to Hive, Pig, MapReduce, and REST clients. This page explains how to install and configure HCatalog
for REST access and for MapReduce and Pig access. For Sqoop, see the section on Sqoop-HCatalog integration
in the Sqoop User Guide.
Use the sections that follow to install, configure and use HCatalog:
• Prerequisites
• Installing and Upgrading the HCatalog RPM or Debian Packages on page 272
• Host Configuration Changes
• Starting and Stopping the WebHCat REST Server
• Accessing Table Data with the Command-line API
• Accessing Table Data with MapReduce
• Accessing Table Data with Pig
• Accessing Table Data with REST
• Apache HCatalog Documentation
You can use HCatalog to import data to HBase. See Importing Data Into HBase.
For more information, see the HCatalog documentation.
HCatalog Prerequisites
• An operating system supported by CDH 5
• Oracle JDK
• The Hive metastore and its database. The Hive metastore must be running in remote mode (as a service).
Installing and Upgrading the HCatalog RPM or Debian Packages
Installing the HCatalog RPM or Debian packages is more convenient than installing the HCatalog tarball because
the packages:
• Handle dependencies
• Provide for easy upgrades
• Automatically install resources to conventional locations
HCatalog comprises the following packages:
• hive-hcatalog - HCatalog wrapper for accessing the Hive metastore, libraries for MapReduce and Pig, and
a command-line program
• hive-webhcat - A REST API server for HCatalog
• hive-webhcat-server - Installs hive-webhcat and a server init script
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5 on page 573, you can skip Step 1 below and proceed with installing
the new CDH 5 version of HCatalog.
Important:
If you have installed the hive-hcatalog-server package in the past, you must remove it before
you proceed; otherwise the upgrade will fail.
Follow instructions under Installing the WebHCat REST Server on page 274 and Installing HCatalog for Use with
Pig and MapReduce on page 274.
Note:
It is not necessary to install WebHCat if you will not be using the REST API. Pig and MapReduce do
not need it.
To install the WebHCat REST server components on an Ubuntu or other Debian system:
Note:
• You can change the default port 50111 by creating or editing the following file and restarting
WebHCat:
/etc/webhcat/conf/webhcat-site.xml
<configuration>
<property>
<name>templeton.port</name>
<value>50111</value>
<description>The HTTP port for the main server.</description>
</property>
</configuration>
• To uninstall WebHCat you must remove two packages: hive-webhcat-server and hive-webhcat.
<property>
<name>hive.metastore.uris</name>
<value>thrift://<hostname>:9083</value>
</property>
where <hostname> is the host where the HCatalog server components are running, for example
hive.examples.com.
# Create a table
$ hcat -e "create table groups(name string,placeholder string,id int) row format
delimited fields terminated by ':' stored as textfile"
OK
See the HCatalog documentation for information on using the HCatalog command-line application.
Accessing Table Data with MapReduce
You can download an example of a MapReduce program that reads from the groups table (consisting of data
from /etc/group), extracts the first and third columns, and inserts them into the groupids table. Proceed as
follows.
1. Download the program from https://github.com/cloudera/hcatalog-examples.git.
2. Build the example JAR file:
$ cd hcatalog-examples
$ mvn package
3. Load data from the local file system into the groups table:
$ hive -e "load data local inpath '/etc/group' overwrite into table groups"
4. Set up the environment that is needed for copying the required JAR files to HDFS, for example:
$ export HCAT_HOME=/usr/lib/hive-hcatalog
$ export HIVE_HOME=/usr/lib/hive
$ HIVE_VERSION=0.11.0-cdh5.0.0
$ HCATJAR=$HCAT_HOME/share/hcatalog/hcatalog-core-$HIVE_VERSION.jar
$ HCATPIGJAR=$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter-$HIVE_VERSION.jar
$ export
HADOOP_CLASSPATH=$HCATJAR:$HCATPIGJAR:$HIVE_HOME/lib/hive-exec-$HIVE_VERSION.jar\
:$HIVE_HOME/lib/hive-metastore-$HIVE_VERSION.jar:$HIVE_HOME/lib/jdo-api-*.jar:$HIVE_HOME/lib/libfb303-*.jar\
:$HIVE_HOME/lib/libthrift-*.jar:$HIVE_HOME/lib/slf4j-api-*.jar:$HIVE_HOME/conf:/etc/hadoop/conf
$ LIBJARS=`echo $HADOOP_CLASSPATH | sed -e 's/:/,/g'`
$ export LIBJARS=$LIBJARS,$HIVE_HOME/lib/antlr-runtime-*.jar
Note: You can find current version numbers for CDH dependencies in CDH's root pom.xml file for
the current release, for example cdh-root-5.0.0.pom.)
Output:
http://<SERVERHOST>:50111/templeton/v1/ddl/database/?user.name=hive
http://<SERVERHOST>:50111/templeton/v1/ddl/database/default/table/?user.name=hive
http://<SERVERHOST>:50111/templeton/v1/ddl/database/default/table/groups?user.name=hive
Example output:
{"columns":[{"name":"name","type":"string"},{"name":"placeholder","type":"string"},{"name":"id","type":"int"}],"database":"default","table":"grouptable"}
Impala Installation
Cloudera Impala is an open-source add-on to the Cloudera Enterprise Core that returns rapid responses to
queries.
Note:
Under CDH 5, Impala is included as part of the CDH installation and no separate steps are needed.
Before doing the installation, ensure that you have all necessary prerequisites. See Cloudera Impala Requirements
on page 278 for details.
Cloudera Impala Requirements
To perform as expected, Impala depends on the availability of the software, hardware, and configurations
described in the following sections.
Product Compatibility Matrix
The ultimate source of truth about compatibility between various versions of CDH, Cloudera Manager, and various
CDH components is the online Product Compatibility Matrix.
For Impala, see the Impala compatibility matrix page.
Supported Operating Systems
The relevant supported operating systems and versions for Impala are the same as for the corresponding CDH
4 and CDH 5 platforms. For details, see the Supported Operating Systems page for CDH 4 or CDH 5.
Hive Metastore and Related Configuration
Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata
about schema objects such as tables and columns. The following components are prerequisites for Impala:
• MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.
Note:
Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without
the metastore database. For the process of installing and configuring the metastore, see Impala
Installation on page 277.
Always configure a Hive metastore service rather than connecting directly to the metastore
database. The Hive metastore service is required to interoperate between possibly different levels
of metastore APIs used by CDH and Impala, and avoids known issues with connecting directly to
the metastore database. The Hive metastore service is set up for you by default if you install
through Cloudera Manager 4.5 or higher.
A summary of the metastore installation process is as follows:
• Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.
• Download the MySQL connector or the PostgreSQL connector and place it in the
/usr/share/java/ directory.
• Use the appropriate command line tool for your database to create the metastore database.
• Use the appropriate command line tool for your database to grant privileges for the metastore
database to the hive user.
• Modify hive-site.xml to include information matching your particular database: its URL, user
name, and password. You will copy the hive-site.xml file to the Impala Configuration Directory
later in the Impala installation process.
• Optional: Hive. Although only the Hive metastore database is required for Impala to function, you might
install Hive on some client machines to create and load data into tables that use certain file formats. See
How Impala Works with Hadoop File Formats for details. Hive does not need to be installed on the same
data nodes as Impala; it just needs access to the same metastore database.
Java Dependencies
Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:
• The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically resulting
in a failure at impalad startup. In particular, the JamVM used by default on certain levels of Ubuntu systems
can cause impalad to fail to start.
• Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java
libraries. Make sure the impalad service is not run from an environment with an incorrect setting for this
variable.
• All Java dependencies are packaged in the impala-dependencies.jar file, which is located at
/usr/lib/impala/lib/. These map to everything that is built under fe/target/dependency.
In the majority of cases, this automatic detection works correctly. If you need to explicitly set the hostname, do
so by setting the --hostname flag.
Hardware Requirements
During join operations, portions of data from each joined table are loaded into memory. Data sets can be very
large, so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.
While requirements vary according to data set size, the following is generally recommended:
• CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.
Note: This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1
releases had a stricter requirement for the SSE4.1 instruction set, which has now been relaxed.
• Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query
processing on a particular node exceed the amount of memory available to Impala on that node, the query
writes temporary work data to disk, which can lead to long query times. Note that because the work is
parallelized, and intermediate results for aggregate queries are typically smaller than the original data, Impala
can query and join tables that are much larger than the memory available on an individual node.
• Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance
with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.
User Account Requirements
Impala creates and uses a user and group named impala. Do not delete this account or group and do not modify
the account's or group's permissions and rights. Ensure no existing systems obstruct the functioning of these
accounts and groups. For example, if you have scripts that delete user accounts not in a white-list, add these
accounts to the list of permitted accounts.
For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components),
the impala user must be a member of the hdfs group. This setup is performed automatically during a new
install, but not when upgrading from earlier Impala releases to Impala 1.2. If you are upgrading a node to CDH
5 that already had Impala 1.1 or 1.0 installed, manually add the impala user to the hdfs group.
For correct file deletion during DROP TABLE operations, Impala must be able to move files to the HDFS trashcan.
You might need to create an HDFS directory /user/impala, writeable by the impala user, so that the trashcan
can be created. Otherwise, data files might remain behind after a DROP TABLE statement.
Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not permitted
to use direct reads. Therefore, running Impala as root negatively affects performance.
By default, any user can connect to Impala and access all the associated databases and tables. You can enable
authorization and authentication based on the Linux OS user who connects to the Impala server, and the
associated groups for that user. Overview of Impala Security for details. These security features do not change
the underlying file permission requirements; the impala user still needs to be able to access the data files.
Installing Impala without Cloudera Manager
Before installing Impala manually, make sure all applicable nodes have the appropriate hardware configuration,
levels of operating system and CDH, and any other software prerequisites. See Cloudera Impala Requirements
on page 278 for details.
You can install Impala across many hosts or on one host:
• Installing Impala across multiple machines creates a distributed configuration. For best performance, install
Impala on all DataNodes.
• Installing Impala on a single machine produces a pseudo-distributed cluster.
To install Impala on a host:
1. Install CDH as described in the Installation section of the CDH 4 Installation Guide or the CDH 5 Installation
Guide.
2. Install the Hive metastore somewhere in your cluster, as described in the Hive Installation topic in the CDH
4 Installation Guide or the CDH 5 Installation Guide. As part of this process, you configure the Hive metastore
to use an external database as a metastore. Impala uses this same database for its own table metadata.
You can choose either a MySQL or PostgreSQL database as the metastore. The process for configuring each
type of database is described in the CDH Installation Guide).
Cloudera recommends setting up a Hive metastore service rather than connecting directly to the metastore
database; this configuration is required when running Impala under CDH 4.1. Make sure the
/etc/impala/conf/hive-site.xml file contains the following setting, substituting the appropriate host
name for metastore_server_host:
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore_server_host:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>3600</value>
<description>MetaStore Client socket timeout in seconds</description>
</property>
3. (Optional) If you installed the full Hive component on any host, you can verify that the metastore is configured
properly by starting the Hive console and querying for the list of available tables. Once you confirm that the
console starts, exit the console to continue the installation:
$ hive
Hive history file=/tmp/root/hive_job_log_root_201207272011_678722950.txt
hive> show tables;
table1
table2
hive> quit;
$
4. Confirm that your package management command is aware of the Impala repository settings, as described
in Cloudera Impala Requirements on page 278. (For CDH 4, this is a different repository than for CDH.) You
might need to download a repo or list file into a system directory underneath /etc.
5. Use one of the following sets of commands to install the Impala package:
For RHEL, Oracle Linux, or CentOS systems:
Note: Cloudera recommends that you not install Impala on any HDFS NameNode. Installing Impala
on NameNodes provides no additional data locality, and executing queries with such a configuration
might cause memory contention and negatively impact the HDFS NameNode.
6. Copy the client hive-site.xml, core-site.xml, hdfs-site.xml, and hbase-site.xml configuration files
to the Impala configuration directory, which defaults to /etc/impala/conf. Create this directory if it does
not already exist.
7. Use one of the following commands to install impala-shell on the machines from which you want to issue
queries. You can install impala-shell on any supported machine that can connect to DataNodes that are
running impalad.
For RHEL/CentOS systems:
Note:
• Each version of CDH 5 has an associated version of Impala, When you upgrade from CDH 4 to CDH
5, you get whichever version of Impala comes with the associated level of CDH. Depending on the
version of Impala you were running on CDH 4, this could install a lower level of Impala on CDH 5.
For example, if you upgrade to CDH 5.0 from CDH 4 plus Impala 1.4, the CDH 5.0 installation comes
with Impala 1.3. Always check the associated level of Impala before upgrading to a specific version
of CDH 5. Where practical, upgrade from CDH 4 to the latest CDH 5, which also has the latest
Impala.
• When you upgrade Impala, also upgrade Cloudera Manager if necessary:
– Users running Impala on CDH 5 must upgrade to Cloudera Manager 5.0.0 or higher.
– Users running Impala on CDH 4 must upgrade to Cloudera Manager 4.8 or higher. Cloudera
Manager 4.8 includes management support for the Impala catalog service, and is the minimum
Cloudera Manager version you can use.
– Cloudera Manager is continually updated with configuration settings for features introduced
in the latest Impala releases.
• If you are upgrading from CDH 5 beta to CDH 5.0 production, make sure you are using the
appropriate CDH 5 repositories shown on the CDH version and packaging page, then follow the
procedures throughout the rest of this section.
• Every time you upgrade to a new major or minor Impala release, see Cloudera Impala Incompatible
Changes in the Release Notes for any changes needed in your source code, startup scripts, and
so on.
• Also check Cloudera Impala Known Issues in the Release Notes for any issues or limitations that
require workarounds.
• For the resource management feature to work (in combination with CDH 5 and the YARN and
Llama components), the impala user must be a member of the hdfs group. This setup is performed
automatically during a new install, but not when upgrading from earlier Impala releases to Impala
1.2. If you are upgrading a node to CDH 5 that already had Impala 1.1 or 1.0 installed, manually
add the impala user to the hdfs group.
Important: In CDH 5, there is not a separate Impala parcel; Impala is part of the main CDH 5 parcel.
Each level of CDH 5 has a corresponding version of Impala, and you upgrade Impala by upgrading
CDH. See the CDH 5 upgrade instructions and choose the instructions for parcels. The remainder of
this section only covers parcel upgrades for Impala under CDH 4.
and then remove the packages using one of the following commands:
3. Go to the Hosts > Parcels tab. You should see a parcel with a newer version of Impala that you can upgrade
to.
4. Click Download, then Distribute. (The button changes as each step completes.)
5. Click Activate.
6. When prompted, click Restart to restart the Impala service.
5. Use one of the following sets of commands to update Impala shell on each node on which it is installed:
For RHEL, Oracle Linux, or CentOS systems:
2. Check if there are new recommended or required configuration settings to put into place in the configuration
files, typically under /etc/impala/conf. See Post-Installation Configuration for Impala for settings related
to performance and scalability.
3. Use one of the following sets of commands to update Impala on each Impala node in your cluster:
For RHEL, Oracle Linux, or CentOS systems:
4. Use one of the following sets of commands to update Impala shell on each node on which it is installed:
For RHEL, Oracle Linux, or CentOS systems:
5. Depending on which release of Impala you are upgrading from, you might find that the symbolic links
/etc/impala/conf and /usr/lib/impala/sbin are missing. If so, see Known Issues in the Current
Production Release (Impala 2.2.x / CDH 5.4.x) for the procedure to work around this problem.
6. Restart Impala services:
a. Restart the Impala state store service on the desired nodes in your cluster. Expect to see a process named
statestored if the service started successfully.
Restart the state store service before the Impala server service to avoid “Not connected” errors when
you run impala-shell.
b. Restart the Impala catalog service on whichever host it runs on in your cluster. Expect to see a process
named catalogd if the service started successfully.
c. Restart the Impala daemon service on each node in your cluster. Expect to see a process named impalad
if the service started successfully.
Note:
If the services did not start successfully (even though the sudo service command might display
[OK]), check for errors in the Impala log file, typically in /var/log/impala.
Starting Impala
To begin using Cloudera Impala:
1. Set any necessary configuration options for the Impala services. See Modifying Impala Startup Options on
page 287 for details.
2. Start one instance of the Impala statestore. The statestore helps Impala to distribute work efficiently, and
to continue running in the event of availability problems for other Impala nodes. If the statestore becomes
unavailable, Impala continues to function.
3. Start one instance of the Impala catalog service.
4. Start the main Impala service on one or more DataNodes, ideally on all DataNodes to maximize local processing
and avoid network traffic due to remote reads.
Once Impala is running, you can conduct interactive experiments using the instructions in Impala Tutorials and
try Using the Impala Shell (impala-shell Command).
Starting Impala through Cloudera Manager
If you installed Impala with Cloudera Manager, use Cloudera Manager to start and stop services. The Cloudera
Manager GUI is a convenient way to check that all services are running, to set configuration options using form
fields in a browser, and to spot potential issues such as low disk space before they become serious. Cloudera
Manager automatically starts all the Impala-related services as a group, in the correct order. See the Cloudera
Manager Documentation for details.
Note:
Currently, Impala UDFs and UDAs are not persisted in the metastore database. Information about
these functions is held in the memory of the catalogd daemon. You must reload them by running
the CREATE FUNCTION statements again each time you restart the catalogd daemon.
Start the Impala service on each data node using a command similar to the following:
Note:
Currently, Impala UDFs and UDAs are not persisted in the metastore database. Information about
these functions is held in the memory of the catalogd daemon. You must reload them by running
the CREATE FUNCTION statements again each time you restart the catalogd daemon.
This file includes information about many resources used by Impala. Most of the defaults included in this file
should be effective in most cases. For example, typically you would not change the definition of the CLASSPATH
variable, but you would always set the address used by the statestore server. Some of the content you might
modify includes:
IMPALA_STATE_STORE_HOST=127.0.0.1
IMPALA_STATE_STORE_PORT=24000
IMPALA_BACKEND_PORT=22000
IMPALA_LOG_DIR=/var/log/impala
IMPALA_CATALOG_SERVICE_HOST=...
IMPALA_STATE_STORE_HOST=...
export IMPALA_STATE_STORE_ARGS=${IMPALA_STATE_STORE_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT}}
IMPALA_SERVER_ARGS=" \
-log_dir=${IMPALA_LOG_DIR} \
-catalog_service_host=${IMPALA_CATALOG_SERVICE_HOST} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore \
-state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}"
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}
To use alternate values, edit the defaults file, then restart all the Impala-related services so that the changes
take effect. Restart the Impala server using the following commands:
IMPALA_STATE_STORE_HOST=127.0.0.1
to:
IMPALA_STATE_STORE_HOST=192.168.0.27
• Catalog server address (including both the hostname and the port number). Update the value of the
IMPALA_CATALOG_SERVICE_HOST variable. Cloudera recommends the catalog server be on the same host
as the statestore. In that recommended configuration, the impalad daemon cannot refer to the catalog
server using the loopback address. If the catalog service is hosted on a machine with an IP address of
192.168.0.27, add the following line:
IMPALA_CATALOG_SERVICE_HOST=192.168.0.27:26000
The /etc/default/impala defaults file currently does not define an IMPALA_CATALOG_ARGS environment
variable, but if you add one it will be recognized by the service startup/shutdown script. Add a definition for
this variable to /etc/default/impala and add the option -catalog_service_host=hostname. If the port
is different than the default 26000, also add the option -catalog_service_port=port.
• Memory limits. You can limit the amount of memory available to Impala. For example, to allow Impala to use
no more than 70% of system memory, change:
export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} \
-state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT}}
to:
export IMPALA_SERVER_ARGS=${IMPALA_SERVER_ARGS:- \
-log_dir=${IMPALA_LOG_DIR} -state_store_port=${IMPALA_STATE_STORE_PORT} \
-use_statestore -state_store_host=${IMPALA_STATE_STORE_HOST} \
-be_port=${IMPALA_BACKEND_PORT} -mem_limit=70%}
You can specify the memory limit using absolute notation such as 500m or 2G, or as a percentage of physical
memory such as 60%.
Note: Queries that exceed the specified memory limit are aborted. Percentage limits are based
on the physical memory of the machine and do not consider cgroups.
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-false}
to:
export ENABLE_CORE_DUMPS=${ENABLE_COREDUMPS:-true}
Note: The location of core dump files may vary according to your operating system configuration.
Other security settings may prevent Impala from writing core dumps even when this option is
enabled.
• Authorization using the open source Sentry plugin. Specify the -server_name and
-authorization_policy_file options as part of the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
settings to enable the core Impala support for authentication. See Starting the impalad Daemon with Sentry
Authorization Enabled for details.
• Auditing for successful or blocked Impala queries, another aspect of security. Specify the
-audit_event_log_dir=directory_path option and optionally the
-max_audit_event_log_file_size=number_of_queries and -abort_on_failed_audit_event options
as part of the IMPALA_SERVER_ARGS settings, for each Impala node, to enable and customize auditing. See
Auditing Impala Operations for details.
• Password protection for the Impala web UI, which listens on port 25000 by default. This feature involves
adding some or all of the --webserver_password_file, --webserver_authentication_domain, and
--webserver_certificate_file options to the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
settings. See Security Guidelines for Impala for details.
• Another setting you might add to IMPALA_SERVER_ARGS is:
-default_query_options='option=value;option=value;...'
These options control the behavior of queries performed by this impalad instance. The option values you
specify here override the default values for Impala query options, as shown by the SET statement in
impala-shell.
• Options for resource management, in conjunction with the YARN and Llama components. These options
include -enable_rm, -llama_host, -llama_port, -llama_callback_port, and -cgroup_hierarchy_path.
Additional options to help fine-tune the resource estimates are -—rm_always_use_defaults,
-—rm_default_memory=size, and -—rm_default_cpu_cores. For details about these options, see impalad
Startup Options for Resource Management. See Integrated Resource Management with YARN for information
about resource management in general, and The Llama Daemon for information about the Llama daemon.
• During troubleshooting, Cloudera Support might direct you to change other values, particularly for
IMPALA_SERVER_ARGS, to work around issues or gather debugging information.
The following startup options for impalad enable resource management and customize its parameters for your
cluster configuration:
• -enable_rm: Whether to enable resource management or not, either true or false. The default is false.
None of the other resource management options have any effect unless -enable_rm is turned on.
• -llama_host: Hostname or IP address of the Llama service that Impala should connect to. The default is
127.0.0.1.
• -llama_port: Port of the Llama service that Impala should connect to. The default is 15000.
• -llama_callback_port: Port that Impala should start its Llama callback service on. Llama reports when
resources are granted or preempted through that service.
• -cgroup_hierarchy_path: Path where YARN and Llama will create cgroups for granted resources. Impala
assumes that the cgroup for an allocated container is created in the path 'cgroup_hierarchy_path +
container_id'.
• -rm_always_use_defaults: If this Boolean option is enabled, Impala ignores computed estimates and
always obtains the default memory and CPU allocation from Llama at the start of the query. These default
estimates are approximately 2 CPUs and 4 GB of memory, possibly varying slightly depending on cluster size,
workload, and so on. Cloudera recommends enabling -rm_always_use_defaults whenever resource
management is used, and relying on these default values (that is, leaving out the two following options).
• -rm_default_memory=size: Optionally sets the default estimate for memory usage for each query. You
can use suffixes such as M and G for megabytes and gigabytes, the same as with the MEM_LIMIT query
option. Only has an effect when -rm_always_use_defaults is also enabled.
• -rm_default_cpu_cores: Optionally sets the default estimate for number of virtual CPU cores for each
query. Only has an effect when -rm_always_use_defaults is also enabled.
Note:
These startup options for the impalad daemon are different from the command-line options for the
impala-shell command. For the impala-shell options, see impala-shell Configuration Options.
The statestored daemon implements the Impala statestore service, which monitors the availability of Impala
services across the cluster, and handles situations such as nodes becoming unavailable or becoming available
again.
Startup Options for catalogd Daemon
The catalogd daemon implements the Impala catalog service, which broadcasts metadata changes to all the
Impala nodes when Impala creates a table, inserts data, or performs other kinds of DDL and DML operations.
By default, the metadata loading and caching on startup happens asynchronously, so Impala can begin accepting
requests promptly. To enable the original behavior, where Impala waited until all metadata was loaded before
accepting any requests, set the catalogd configuration option --load_catalog_in_background=false.
Hive Installation
Warning: HiveServer1 is deprecated in CDH 5.3, and will be removed in a future release of CDH. Users
of HiveServer1 should upgrade to HiveServer2 as soon as possible.
About Hive
Apache Hive is a powerful data warehousing application built on top of Hadoop; it enables you to access your
data using Hive QL, a language that is similar to SQL.
Note:
As of CDH 5, Hive includes HCatalog, but you still need to install HCatalog separately if you want to
use it; see HCatalog Installation on page 271.
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in
your Hadoop cluster.
HiveServer2
You need to deploy HiveServer2, an improved version of HiveServer that supports a Thrift API tailored for JDBC
and ODBC clients, Kerberos authentication, and multi-client concurrency. The CLI for HiveServer2 is Beeline.
Important:
The original HiveServer and command-line interface (CLI) are deprecated; use HiveServer2 and Beeline.
Upgrading Hive
Upgrade Hive on all the hosts on which it is running: servers and clients.
Note: To see which version of Hive is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Warning:
Make sure you have read and understood all incompatible changes and known issues before you
upgrade Hive.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH
5 version of Hive.
Warning:
You must make sure no Hive processes are running. If Hive processes are running during the upgrade,
the new version will not work correctly.
1. Exit the Hive console and make sure no Hive scripts are running.
2. Stop any HiveServer processes that are running. If HiveServer is running as a daemon, use the following
command to stop it:
If the metastore is running from the command line, stop it with <CTRL>-c.
4. Remove Hive:
Step 2: Install the new Hive version on all hosts (Hive servers and clients)
See Installing Hive.
Important:
• Cloudera strongly encourages you to make a backup copy of your metastore database before
running the upgrade scripts. You will need this backup copy if you run into problems during the
upgrade or need to downgrade to a previous version.
• You must upgrade the metastore schema to the version corresponding to the new version of Hive
before starting Hive after the upgrade. Failure to do so may result in metastore corruption.
• To run a script, you must first cd to the directory that script is in: that is
/usr/lib/hive/scripts/metastore/upgrade/<database>.
As of CDH 5, there are now two ways to do this. You could either use Hive's schematool or use the schema
upgrade scripts provided with the Hive package.
Using schematool (Recommended):
The Hive distribution includes an offline tool for Hive metastore schema manipulation called schematool. This
tool can be used to initialize the metastore schema for the current Hive version. It can also upgrade the schema
from an older version to the current one.
To upgrade the schema, use the upgradeSchemaFrom option to specify the version of the schema you are
currently using (see table below) and the compulsory dbType option to specify the database you are using. The
example that follows shows an upgrade from Hive 0.10.0 (CDH 4) for an installation using the Derby database.
Possible values for the dbType option are mysql, postgres, derby or oracle. The following table lists the Hive
versions corresponding to the older CDH releases.
See Using the Hive Schema Tool for more details on how to use schematool.
Using Schema Upgrade Scripts:
Run the appropriate schema upgrade script(s); they are in /usr/lib/hive/scripts/metastore/upgrade/.
Start with the script for your database and Hive version, and run all subsequent scripts.
For example, if you are currently running Hive 0.10 with MySQL, and upgrading to Hive 0.13.1, start with the
script for Hive 0.10 to 0.11 for MySQL, then run the script for Hive 0.11 to 0.12 for MySQL, then run the script for
Hive 0.12 to 0.13.1.
For more information about upgrading the schema, see the README in
/usr/lib/hive/scripts/metastore/upgrade/.
Important:
• If you are currently running Hive under MRv1, check for the following property and value in
/etc/mapred/conf/mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Remove this property before you proceed; otherwise Hive queries spawned from MapReduce jobs
will fail with a null pointer exception (NPE).
• If you have installed the hive-hcatalog-server package in the past, you must remove it before
you proceed; otherwise the upgrade will fail.
• If you are upgrading Hive from CDH 5.0.5 to CDH 5.4, 5.3 or 5.2 on Debian 7.0, and a Sentry version
later than 5.0.4 and earlier than 5.1.0 is installed, you must upgrade Sentry before upgrading Hive;
otherwise the upgrade will fail. See Apache Hive Known Issues for more details.
• CDH 5.2 and later clients cannot communicate with CDH 5.1 and earlier servers. This means that
you must upgrade the server before the clients.
Warning:
You must make sure no Hive processes are running. If Hive processes are running during the upgrade,
the new version will not work correctly.
Step 2: Install the new Hive version on all hosts (Hive servers and clients)
SeeInstalling Hive on page 298
Important:
• Cloudera strongly encourages you to make a backup copy of your metastore database before
running the upgrade scripts. You will need this backup copy if you run into problems during the
upgrade or need to downgrade to a previous version.
• You must upgrade the metastore schema to the version corresponding to the new version of Hive
before starting Hive after the upgrade. Failure to do so may result in metastore corruption.
• To run a script, you must first cd to the directory that script is in: that is
/usr/lib/hive/scripts/metastore/upgrade/<database>.
As of CDH 5, there are now two ways to do this. You could either use Hive's schematool or use the schema
upgrade scripts provided with the Hive package.
Using schematool (Recommended):
The Hive distribution includes an offline tool for Hive metastore schema manipulation called schematool. This
tool can be used to initialize the metastore schema for the current Hive version. It can also upgrade the schema
from an older version to the current one.
To upgrade the schema, use the upgradeSchemaFrom option to specify the version of the schema you are
currently using (see table below) and the compulsory dbType option to specify the database you are using. The
example that follows shows an upgrade from Hive 0.10.0 (CDH 4) for an installation using the Derby database.
Possible values for the dbType option are mysql, postgres, derby or oracle. The following table lists the Hive
versions corresponding to the older CDH releases.
See Using the Hive Schema Tool for more details on how to use schematool.
Using Schema Upgrade Scripts:
Run the appropriate schema upgrade script(s); they are in /usr/lib/hive/scripts/metastore/upgrade/.
Start with the script for your database and Hive version, and run all subsequent scripts.
For example, if you are currently running Hive 0.10 with MySQL, and upgrading to Hive 0.13.1, start with the
script for Hive 0.10 to 0.11 for MySQL, then run the script for Hive 0.11 to 0.12 for MySQL, then run the script for
Hive 0.12 to 0.13.1.
For more information about upgrading the schema, see the README in
/usr/lib/hive/scripts/metastore/upgrade/.
Hive Schema version 0.13.0 does not match metastore's schema version 0.12.0
Metastore is not upgraded or corrupt.
Hive Schema version 0.13.0 does not match metastore's schema version 0.12.0
Metastore is not upgraded or corrupt.
Installing Hive
Install the appropriate Hive packages using the appropriate command for your distribution.
OS Command
RHEL-compatible $ sudo yum install <pkg1> <pkg2> ...
In addition, workstations running The Beehive CLI should use a heap size of at least 2 GB.
else
export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms10m
-XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-useGCOverheadLimit"
fi
fi
export HADOOP_HEAPSIZE=2048
You can choose whether to use the Concurrent Collector or the New Parallel Collector for garbage collection, by
passing -XX:+useParNewGC or -XX:+useConcMarkSweepGC in the HADOOP_OPTS lines above, and you can tune
the garbage collection overhead limit by setting -XX:-useGCOverheadLimit. To enable the garbage collection
overhead limit, remove the setting or change it to -XX:+useGCOverheadLimit.
Configuration for WebHCat
If you want to use WebHCat, you need to set the PYTHON_CMD variable in /etc/default/hive-webhcat-server
after installing Hive; for example:
export PYTHON_CMD=/usr/bin/python
Note: HiveServer in the discussion that follows refers to HiveServer1 or HiveServer2, whichever you
are using.
Embedded Mode
Cloudera recommends using this mode for experimental purposes only.
This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database,
and both the database and the metastore service run embedded in the main HiveServer process. Both are
started for you when you start the HiveServer process. This mode requires the least amount of effort to configure,
but it can support only one active user at a time and is not certified for production use.
Local Mode
In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the
metastore database runs in a separate process, and can be on a separate host. The embedded metastore service
communicates with the metastore database over JDBC.
Remote Mode
Cloudera recommends that you use this mode.
In this mode the Hive metastore service runs in its own JVM process; HiveServer2, HCatalog, Cloudera Impala™,
and other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris
property). The metastore service communicates with the metastore database over JDBC (configured via the
javax.jdo.option.ConnectionURL property). The database, the HiveServer process, and the metastore service
can all be on the same host, but running the HiveServer process on a separate host provides better availability
and scalability.
The main advantage of Remote mode over Local mode is that Remote mode does not require the administrator
to share JDBC login information for the metastore database with each Hive user. HCatalogrequires this mode.
Supported Metastore Databases
See Supported Databases on page 21 for up-to-date information on supported databases. Cloudera strongly
encourages you to use MySQL because it is the most popular with the rest of the Hive user community, and so
receives more testing than the other options.
Metastore Memory Requirements
For information on configuring heap for Hive MetaStore, as well as HiveServer2 and Hive clients, see Configuring
Heap Size and Garbage Collection for Hive Components on page 299.
Configuring the Metastore Database
This section describes how to configure Hive to use a remote database, with examples for MySQL and PostgreSQL.
The configuration properties for the Hive metastore are documented on the Hive Metastore documentation
page, which also includes a pointer to the E/R diagram for the Hive metastore.
Note: For information about additional configuration that may be needed in a secure cluster, see
Hive Authentication.
After using the command to install MySQL, you may need to respond to prompts to confirm that you do want
to complete the installation. After installation completes, start the mysql daemon.
On Red Hat systems
On the Hive Metastore server host, install mysql-connector-java and symbolically link the file into the
/usr/lib/hive/lib/ directory.
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar
/usr/lib/hive/lib/
Note: At the time of publication, version was 5.1.31, but the version may have changed by the
time you read this. If you are using MySQL version 5.6, you must use version 5.1.26 or later of the
driver.
Configure MySQL to use a strong password and to start at boot. Note that in the following procedure, your
current root password is blank. Press the Enter key when you're prompted for the root password.
To set the MySQL root password:
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!
• On SLES systems:
• On Debian/Ubuntu systems:
Note:
If the metastore service will run on the host where the database is installed, replace
'metastorehost' in the CREATE USER example with 'localhost'. Similarly, the value of
javax.jdo.option.ConnectionURL in /etc/hive/conf/hive-site.xml (discussed in the next
step) must be jdbc:mysql://localhost/metastore. For more information on adding MySQL
users, see http://dev.mysql.com/doc/refman/5.5/en/adding-users.html.
Create the initial database schema. Cloudera recommends using the Hive schema tool to do this.
If for some reason you decide not to use the schema tool, you can use the hive-schema-0.12.0.mysql.sql
file instead; that file is located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory.
Proceed as follows if you decide to use hive-schema-0.12.0.mysql.sql.
Example using hive-schema-0.12.0.mysql.sql
Note:
Do this only if you are not using the Hive schema tool.
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE
/usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.12.0.mysql.sql;
You also need a MySQL user account for Hive to use to access the metastore. It is very important to prevent
this user account from creating or altering tables in the metastore database schema.
Important: To prevent users from inadvertently corrupting the metastore schema when they use
older or newer versions of Hive, set the hive.metastore.schema.verification property to
true in /usr/lib/hive/conf/hive-site.xml on the metastore host.
Example
Note:
The hive.metastore.local property is no longer supported as of Hive 0.10; setting
hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://myhost/metastore</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoStartMechanism</name>
<value>SchemaTable</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
After using the command to install PostgreSQL, you may need to respond to prompts to confirm that you
do want to complete the installation. In order to finish installation on Red Hat compatible systems, you need
to initialize the database. Please note that this operation is not needed on Ubuntu and SLES systems as it's
done automatically on first start:
To initialize database files on Red Hat compatible systems
To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional
configuration.
First you need to edit the postgresql.conf file. Set the listen_addresses property to *, to make sure
that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the
standard_conforming_strings property is set to off.
You can check that you have the correct values as follows:
On Red-Hat-compatible systems:
On SLES systems:
You also need to configure authentication for your network in pg_hba.conf. You need to make sure that
the PostgreSQL user that you will create later in this procedure will have access to the server from a remote
host. To do this, add a new line into pg_hba.con that has the following information:
The following example allows all users to connect from all hosts to all your databases:
Note:
This configuration is applicable only for a network listener. Using this configuration won't open all
your databases to the entire world; the user must still supply a password to authenticate himself,
and privilege restrictions configured in PostgreSQL will still be applied.
After completing the installation and configuration, you can start the database server:
Start PostgreSQL Server
Use chkconfig utility to ensure that your PostgreSQL server will start at a boot time. For example:
chkconfig postgresql on
You can use the chkconfig utility to verify that PostgreSQL server will be started at boot time, for example:
$ wget http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar
$ mv postgresql-9.2-1002.jdbc4.jar /usr/lib/hive/lib/
Note:
You may need to use a different version if you have a different version of Postgres. You can check
the version as follows:
Now you need to grant permission for all metastore tables to user hiveuser. PostgreSQL does not have
statements to grant the permissions for all tables at once; you'll need to grant the permissions one table at
a time. You could automate the task with the following SQL script:
Note:
If you are running these commands interactively and are still in the Postgres session initiated at
the beginning of this step, you do not need to repeat sudo -u postgres psql.
You can verify the connection from the machine where you'll be running the metastore service as follows:
HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others
are used only on the metastore host.
Given a PostgreSQL database running on host myhost under the user account hive with the password
mypassword, you would set configuration properties as follows.
Note:
• The instructions in this section assume you are using Remote mode, and that the PostgreSQL
database is installed on a separate host from the metastore server.
• The hive.metastore.local property is no longer supported as of Hive 0.10; setting
hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://myhost/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
The Oracle database is not part of any Linux distribution and must be purchased, downloaded and installed
separately. You can use the Express edition, which can be downloaded free from Oracle website.
2. Install the Oracle JDBC Driver
You must download the Oracle JDBC Driver from the Oracle website and put the file ojdbc6.jar into
/usr/lib/hive/lib/ directory. The driver is available for download here.
Note:
These URLs were correct at the time of publication, but the Oracle site is restructured frequently.
Connect as the newly created hiveuser user and load the initial schema, as in the following example (use
the appropriate script for the current release in /usr/lib/hive/scripts/metastore/upgrade/oracle/
:
$ sqlplus hiveuser
SQL> @/usr/lib/hive/scripts/metastore/upgrade/oracle/hive-schema-0.12.0.oracle.sql
Connect back as an administrator and remove the power privileges from user hiveuser. Then grant limited
access to all the tables:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:oracle:thin:@//myhost/xe</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>oracle.jdbc.OracleDriver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
Configuring HiveServer2
You must make the following configuration changes before using HiveServer2. Failure to do so may result in
unpredictable behavior.
Warning: HiveServer1 is deprecated in CDH 5.3, and will be removed in a future release of CDH. Users
of HiveServer1 should upgrade to HiveServer2 as soon as possible.
For information on configuring heap for HiveServer2, as well as Hive MetaStore and Hive clients, see Configuring
Heap Size and Garbage Collection for Hive Components on page 299.
Table Lock Manager (Required)
You must properly configure and enable Hive's Table Lock Manager. This requires installing ZooKeeper and
setting up a ZooKeeper ensemble; see ZooKeeper Installation.
Important:
Failure to do this will prevent HiveServer2 from handling concurrent query requests and may result
in data corruption.
Enable the lock manager by setting properties in /etc/hive/conf/hive-site.xml as follows (substitute your
actual ZooKeeper node names for those in the example):
<property>
<name>hive.support.concurrency</name>
<description>Enable Hive's Table Lock Manager Service</description>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<description>Zookeeper quorum used by Hive's Table Lock Manager</description>
<value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value>
</property>
Important:
Enabling the Table Lock Manager without specifying a list of valid Zookeeper quorum nodes will result
in unpredictable behavior. Make sure that both properties are properly configured.
(The above settings are also needed if you are still using HiveServer1. HiveServer1 is deprecated; migrate to
HiveServer2 as soon as possible.)
hive.zookeeper.client.port
If ZooKeeper is not using the default value forClientPort, you need to sethive.zookeeper.client.portin
/etc/hive/conf/hive-site.xml to the same value that ZooKeeper is using. Check
/etc/zookeeper/conf/zoo.cfg to find the value forClientPort. IfClientPortis set to any value other than 2181
(the default), sethive.zookeeper.client.portto the same value. For example, ifClientPortis set to 2222,
sethive.zookeeper.client.portto 2222 as well:
<property>
<name>hive.zookeeper.client.port</name>
<value>2222</value>
<description>
The port at which the clients will connect.
</description>
</property>
JDBC driver
The connection URL format and the driver class are different for HiveServer2 and HiveServer1:
Authentication
HiveServer2 can be configured to authenticate all connections; by default, it allows any client to connect.
HiveServer2 supports either Kerberos or LDAP authentication; configure this in the
hive.server2.authentication property in the hive-site.xml file. You can also configure Pluggable
Authentication, which allows you to use a custom authentication provider for HiveServer2; and HiveServer2
Impersonation, which allows users to execute queries and access HDFS files as the connected user rather than
the super user who started the HiveServer2 daemon. For more information, see Hive Security Configuration.
Important:
Cloudera strongly recommends running HiveServer2 instead of the original HiveServer (HiveServer1)
package; HiveServer1 is deprecated.
HiveServer2 and HiveServer1 can be run concurrently on the same system, sharing the same data sets. This
allows you to run HiveServer1 to support, for example, Perl or Python scripts that use the native HiveServer1
Thrift bindings.
Both HiveServer2 and HiveServer1 bind to port 10000 by default, so at least one of them must be configured to
use a different port. You can set the port for HiveServer2 in hive-site.xml by means of the
hive.server2.thrift.port property. For example:
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
<description>TCP port number to listen on, default 10000</description>
</property>
You can also specify the port (and the host IP address in the case of HiveServer2) by setting these environment
variables:
Important:
If you are running the metastore in Remote mode, you must start the metastore before starting
HiveServer2.
Important:
Cloudera recommends setting permissions on the Hive warehouse directory to 1777, making it
accessible to all users, with the sticky bit set. This allows users to create and access their tables, but
prevents them from deleting tables they don't own.
In addition, each user submitting queries must have an HDFS home directory. /tmp (on the local file system)
must be world-writable, as Hive makes extensive use of it.
HiveServer2 Impersonation allows users to execute queries and access HDFS files as the connected user.
If you do not enable impersonation, HiveServer2 by default executes all Hive tasks as the user ID that starts the
Hive server; for clusters that use Kerberos authentication, this is the ID that maps to the Kerberos principal used
with HiveServer2. Setting permissions to 1777, as recommended above, allows this user access to the Hive
warehouse directory.
You can change this default behavior by setting hive.metastore.execute.setugi to true on both the server
and client. This setting causes the metastore server to use the client's user and group permissions.
Starting, Stopping, and Using HiveServer2
HiveServer2 is an improved version of HiveServer that supports Kerberos authentication and multi-client
concurrency. Cloudera recommends HiveServer2.
Warning:
If you are running the metastore in Remote mode, you must start the Hive metastore before you
start HiveServer2. HiveServer2 tries to communicate with the metastore as part of its initialization
bootstrap. If it is unable to do this, it fails with an error.
To start HiveServer2:
To stop HiveServer2:
To confirm that HiveServer2 is working, start the beeline CLI and use it to execute a SHOW TABLES query on
the HiveServer2 process:
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
+-----------+
| tab_name |
+-----------+
+-----------+
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>
Note:
Cloudera does not currently support using the Thrift HTTP protocol to connect Beeline to HiveServer2
(meaning that you cannot set hive.server2.transport.mode=http). Use the Thrift TCP protocol.
Use the following commands to start beeline and connect to a running HiveServer2 process. In this example
the HiveServer2 process is running on localhost at port 10000:
$ beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000>
Note:
If you are using HiveServer2 on a cluster that does not have Kerberos security enabled, then the
password is arbitrary in the command for starting Beeline.
If you are using HiveServer2 on a cluster that does have Kerberos security enabled, see HiveServer2
Security Configuration.
As of CDH 5.2, there are still some Hive CLI features that are not available with Beeline. For example:
• Beeline does not show query logs like the Hive CLI
• When adding JARs to HiveServer2 with Beeline, the JARs must be on the HiveServer2 host.
At present the best source for documentation on Beeline is the original SQLLine documentation.
Starting HiveServer1 and the Hive Console
Important:
Because of concurrency and security issues, HiveServer1 is deprecated in CDH 5 and will be removed
in a future release. Cloudera recommends you migrate to Beeline and HiveServer2 as soon as possible.
The Hive Console is not needed if you are using Beeline with HiveServer2.
To start HiveServer1:
$ hive
hive>
To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a
semi-colon after the command:
...
Caused by: MetaException(message:Version information not found in metastore. )
at
org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5638)
...
To suppress the schema check and allow the metastore to implicitly modify the schema, you need to set the
hive.metastore.schema.verification configuration property to false in hive-site.xml.
Using schematool
The Hive distribution now includes an offline tool for Hive metastore schema manipulation called schematool.
This tool can be used to initialize the metastore schema for the current Hive version. It can also handle upgrading
schema from an older version to the current one. The tool will try to find the current schema from the metastore
if available. However, this will be applicable only to any future upgrades. In case you are upgrading from existing
CDH releases like CDH 4 or CDH 3, you should specify the schema version of the existing metastore as a command
line option to the tool.
The schematool figures out the SQL scripts required to initialize or upgrade the schema and then executes
those scripts against the backend database. The metastore database connection information such as JDBC URL,
JDBC driver and database credentials are extracted from the Hive configuration. You can provide alternate
database credentials if needed.
The following options are available as part of the schematool package.
$ schematool -help
usage: schemaTool
-dbType <databaseType> Metastore database type
-dryRun List SQL scripts (no execute)
-help Print this message
-info Show config and schema details
-initSchema Schema initialization
-initSchemaTo <initTo> Schema initialization to a version
-passWord <password> Override config file password
-upgradeSchema Schema upgrade
-upgradeSchemaFrom <upgradeFrom> Schema upgrade from a version
-userName <user> Override config file user name
-verbose Only print SQL statements
The dbType option should always be specified and can be one of the following:
derby|mysql|postgres|oracle
Usage Examples
• Initialize your metastore to the current schema for a new Hive setup using the initSchema option.
• If you attempt to get schema information from older metastores that did not store version information, the
tool will report an error as follows.
• You can upgrade schema from a CDH 4 release by specifying the upgradeSchemaFrom option.
• If you want to find out all the required scripts for a schema upgrade, use the dryRun option.
Note:
The CDH 5.2 Hive JDBC driver is not wire-compatible with the CDH 5.1 version of HiveServer2. Make
sure you upgrade Hive clients and all other Hive hosts in tandem: the server first, and then the clients.
1. Install the package (it is included in CDH packaging). Use one of the following commands, depending on the
target operating system:
• On Red-Hat-compatible systems:
• On SLES systems:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
• For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or
Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
a 100 TB table into 10,000 partitions, each 10 GB in size. In addition, do not use more than 10,000 partitions per
table. Having too many small partitions puts significant strain on the Hive MetaStore and does not improve
performance.
Hive Queries Fail with "Too many counters" Error
Explanation
Hive operations use various counters while executing MapReduce jobs. These per-operator counters are enabled
by the configuration setting hive.task.progress. This is disabled by default; if it is enabled, Hive may create
a large number of counters (4 counters per operator, plus another 20).
Note:
If dynamic partitioning is enabled, Hive implicitly enables the counters during data load.
By default, CDH restricts the number of MapReduce counters to 120. Hive queries that require more counters
will fail with the "Too many counters" error.
What To Do
If you run into this error, set mapreduce.job.counters.max in mapred-site.xml to a higher value.
Viewing the Hive Documentation
For additional Hive documentation, see the Apache Hive wiki.
To view Cloudera's video tutorial about using Hive, see Introduction to Apache Hive.
HttpFS Installation
• Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages
other than Java (such as Perl).
• Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning
issues), for example using Hadoop DistCp.
• Read and write data in HDFS in a cluster behind a firewall. (The HttpFS server acts as a gateway and is the
only system that is allowed to send and receive data through the firewall).
HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication
mechanisms via a plugin API. HttpFS also supports Hadoop proxy user functionality.
The webhdfs client file system implementation can access HttpFS via the Hadoop filesystem command (hadoop
fs), by using Hadoop DistCp, and from Java applications using the Hadoop file system Java API.
The HttpFS HTTP REST API is interoperable with the WebHDFS REST HTTP API.
For more information about HttpFS, see Hadoop HDFS over HTTP.
HttpFS Packaging
There are two packaging options for installing HttpFS:
• The hadoop-httpfs RPM package
• The hadoop-httpfs Debian package
You can also download a Hadoop tarball, which includes HttpFS, from here.
HttpFS Prerequisites
Prerequisites for installing HttpFS are:
• An operating system supported by CDH 5
• Java: see Java Development Kit Installation for details
Note:
To see which version of HttpFS is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes. CDH 5
Hadoop works with the CDH 5 version of HttpFS.
Installing HttpFS
HttpFS is distributed in the hadoop-httpfs package. To install it, use your preferred package manager application.
Install the package on the system that will run the HttpFS server.
Note:
Installing the httpfs package creates an httpfs service configured to start HttpFS at system startup
time.
You are now ready to configure HttpFS. See the next section.
Configuring HttpFS
When you install HttpFS from an RPM or Debian package, HttpFS creates all configuration, documentation, and
runtime files in the standard Unix directories, as follows.
Binaries /usr/lib/hadoop-httpfs/
Configuration /etc/hadoop-httpfs/conf/
Data /var/lib/hadoop-httpfs/
Logs /var/log/hadoop-httpfs/
temp /var/tmp/hadoop-httpfs/
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
If you see the message Server httpfs started!, status NORMAL in the httpfs.log log file, the system
has started successfully.
Note:
By default, HttpFS server runs on port 14000 and its URL is
http://<HTTPFS_HOSTNAME>:14000/webhdfs/v1.
$ curl "http://localhost:14000/webhdfs/v1?op=gethomedirectory&user.name=babu"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie:
hadoop.auth="u=babu&p=babu&t=simple&e=1332977755010&s=JVfT4T785K4jeeLNWXK68rc/0xI=";
Version=1; Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Date: Wed, 28 Mar 2012 13:35:55 GMT
{"Path":"\/user\/babu"}
See the WebHDFS REST API web page for complete documentation of the API.
Hue Installation
Hue is a suite of applications that provide web-based access to CDH components and a platform for building
custom applications.
The following figure illustrates how Hue works. Hue Server is a "container" web application that sits in between
your CDH installation and the browser. It hosts the Hue applications and communicates with various servers
that interface with CDH components.
The Hue Server uses a database to manage session, authentication, and Hue application data. For example, the
Job Designer application stores job designs in the database.
In a CDH cluster, the Hue Server runs on a special node. For optimal performance, this should be one of the
nodes within your cluster, though it can be a remote node as long as there are no overly restrictive firewalls.
For small clusters of less than 10 nodes, you can use your existing master node as the Hue Server. In a
pseudo-distributed installation, the Hue Server runs on the same machine as the rest of your CDH services.
Follow the instructions in the following sections to upgrade, install, configure, and administer Hue.
• Supported Browsers
• Upgrading Hue on page 324
• Installing Hue on page 325
• Configuring CDH Components for Hue on page 327
• Hue Configuration on page 331
• Administering Hue on page 342
• Hue User Guide
Supported Browsers
The Hue UI is supported on the following browsers:
• Windows: Chrome, Firefox 17+, Internet Explorer 9+, Safari 5+
• Linux: Chrome, Firefox 17+
Note:
To see which version of Hue is shipping in CDH 5, check the Version and Packaging Information. For
important information on new and changed components, see the CDH 5 Release Notes.
• On SLES systems:
use hue;
show create table auth_user;
3. Search for the "ENGINE=" line and confirm that its value matches the one for the "default-storage-engine"
above.
If the default engines do not match, Hue will display a warning on its start-up page
(http://$HUE_HOST:$HUE_PORT/about). Work with your database administrator to convert the current
Hue MySQL tables to the engine in use by MySQL, as noted by the "default-storage-engine" property.
Warning:
You must stop Hue. If Hue is running during the upgrade, the new version will not work correctly.
$ su -c 'rpm -Uvh
http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'
...
$ yum install python26
You must install the hue-common package on the machine where you will run the Hue Server. In addition, if you
will be using Hue with MRv1, you must install the hue-plugins package on the system where you are running
the JobTracker. (In pseudo-distributed mode, these will all be the same system.)
The hue meta-package installs the hue-common package and all the Hue applications; you also need to install
hue-server, which contains the Hue start and stop scripts.
Note: If you do not know which system your JobTracker is on, install the hue-plugins package on
every node in the cluster.
On RHEL systems:
• On the Hue Server machine, install the hue package:
• For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the
hue-plugins package:
On SLES systems:
• On the Hue Server machine, install the hue package:
• For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the
hue-plugins package:
• For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the
hue-plugins package:
Important: For all operating systems, restart the Hue service once installation is complete. See
Starting and Stopping the Hue Server on page 342.
Hue Dependencies
The following table shows the components that are dependencies for the different Hue applications:
a. Add the following property in hdfs-site.xml to enable WebHDFS in the NameNode and DataNodes:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
[hadoop]
[[hdfs_clusters]]
[[[default]]]
# Use WebHdfs/HttpFs as the communication mechanism.
WebHDFS:
...
webhdfs_url=http://FQDN:50070/webhdfs/v1/
HttpFS:
...
webhdfs_url=http://FQDN:14000/webhdfs/v1/
Note: If the webhdfs_url is uncommented and explicitly set to the empty value, Hue falls back
to using the Thrift plugin used in Hue 1.x. This is not recommended.
MRv1 Configuration
Hue communicates with the JobTracker via the Hue plugin, which is a .jar file that should be placed in your
MapReduce lib directory.
Important: The hue-plugins package installs the Hue plugins in your MapReduce lib directory,
/usr/lib/hadoop/lib. If you are not using the package-based installation procedure, perform the
following steps to install the Hue plugins.
If your JobTracker and Hue Server are located on the same host, copy the file over. If you are currently using CDH
4, your MapReduce library directory might be in /usr/lib/hadoop/lib.
$ cd /usr/lib/hue
$ cp desktop/libs/hadoop/java-lib/hue-plugins-*.jar /usr/lib/hadoop-0.20-mapreduce/lib
If your JobTracker runs on a different host, scp the Hue plugins .jar file to the JobTracker host.
Add the following properties to mapred-site.xml:
<property>
<name>jobtracker.thrift.address</name>
<value>0.0.0.0:9290</value>
</property>
<property>
<name>mapred.jobtracker.plugins</name>
<value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value>
<description>Comma-separated list of jobtracker plug-ins to be activated.</description>
</property>
You can confirm that the plugins are running correctly by tailing the daemon logs:
Note: If you enable ACLs in the JobTracker, you must add users to the JobTracker
mapred.queue.default.acl-administer-jobs property in order to allow Hue to display jobs in
the Job Browser application. For example, to give the hue user access to the JobTracker, you would
add the following property:
<property>
<name>mapred.queue.default.acl-administer-jobs</name>
<value>hue</value>
</property>
Repeat this for every user that requires access to the job details displayed by the JobTracker.
If you have any mapred queues besides "default", you must add a property for each queue:
<property>
<name>mapred.queue.default.acl-administer-jobs</name>
<value>hue</value>
</property>
<property>
<name>mapred.queue.queue1.acl-administer-jobs</name>
<value>hue</value>
</property>
<property>
<name>mapred.queue.queue2.acl-administer-jobs</name>
<value>hue</value>
</property>
Oozie Configuration
In order to run DistCp, Streaming, Pig, Sqoop, and Hive jobs in Job Designer or the Oozie Editor/Dashboard
application, you must make sure the Oozie shared libraries are installed for the correct version of MapReduce
(MRv1 or YARN). See Installing the Oozie ShareLib in Hadoop HDFS for instructions.
To configure Hue as a default proxy user, add the following properties to /etc/oozie/conf/oozie-site.xml:
Search Configuration
See Search Configuration on page 336 for details on how to configure the Search application for Hue.
HBase Configuration
See HBase Configuration on page 336 for details on how to configure the HBase Browser application.
Hive Configuration
The Beeswax daemon has been replaced by HiveServer2. Hue should therefore point to a running HiveServer2.
This change involved the following major updates to the [beeswax] section of the Hue configuration file, hue.ini.
[beeswax]
# Host where Hive server Thrift daemon is running.
# If Kerberos security is enabled, use fully-qualified domain name (FQDN).
## hive_server_host=<FQDN of HiveServer2>
Permissions
See File System Permissions in the Hive Installation section.
Other Hadoop Settings
HADOOP_CLASSPATH
If you are setting $HADOOP_CLASSPATH in your hadoop-env.sh, be sure to set it in such a way that user-specified
options are preserved. For example:
Correct:
# HADOOP_CLASSPATH=<your_additions>:$HADOOP_CLASSPATH
Incorrect:
# HADOOP_CLASSPATH=<your_additions>
This enables certain components of Hue to add to Hadoop's classpath using the environment variable.
hadoop.tmp.dir
If your users are likely to be submitting jobs both using Hue and from the same machine via the command line
interface, they will be doing so as the hue user when they are using Hue and via their own user account when
they are using the command line. This leads to some contention on the directory specified by hadoop.tmp.dir,
which defaults to /tmp/hadoop-${user.name}. Specifically, hadoop.tmp.dir is used to unpack JARs in
/usr/lib/hadoop. One work around to this is to set hadoop.tmp.dir to
/tmp/hadoop-${user.name}-${hue.suffix} in the core-site.xml file:
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}-${hue.suffix}</value>
</property>
Unfortunately, when the hue.suffix variable is unset, you'll end up with directories named
/tmp/hadoop-user.name-${hue.suffix} in /tmp. Despite that, Hue will still work.
Hue Configuration
This section describes configuration you perform in the Hue configuration file hue.ini. The location of the Hue
configuration file varies depending on how Hue is installed. The location of the Hue configuration folder is
displayed when you view the Hue configuration.
Note: Only the root user can edit the Hue configuration file.
When you log in to Hue, the start-up page displays information about any misconfiguration detected.
To view the Hue configuration, do one of the following:
• Visit http://myserver:port and click the Configuration tab.
• Visit http://myserver:port/dump_config.
Hue Server Configuration
This section describes Hue Server settings.
secret_key=qpbdxoewsqlkhztybvfidtvwekftusgdlofbcfghaswuicmqp
Note: If you don't specify a secret key, your session cookies will not be secure. Hue will run but it
will also display error messages telling you to set the secret key.
Authentication
By default, the first user who logs in to Hue can choose any username and password and automatically becomes
an administrator. This user can create other user and administrator accounts. Hue users should correspond to
the Linux users who will use Hue; make sure you use the same name as the Linux username.
By default, user information is stored in the Hue database. However, the authentication system is pluggable.
You can configure authentication to use an LDAP directory (Active Directory or OpenLDAP) to perform the
authentication, or you can import users and groups from an LDAP directory. See Configuring an LDAP Server for
User Admin on page 337.
For more information, see the Hue SDK Documentation.
1. Configure Hue to use your private key by adding the following options to the Hue configuration file:
ssl_certificate=/path/to/certificate
ssl_private_key=/path/to/key
2. On a production system, you should have an appropriate key signed by a well-known Certificate Authority.
If you're just testing, you can create a self-signed key using the openssl command that may be installed on
your system:
# Create a key
$ openssl genrsa 1024 > host.key
# Create a self-signed certificate
$ openssl req -new -x509 -nodes -sha1 -key host.key > host.cert
Note: Uploading files using the Hue File Browser over HTTPS requires using a proper SSL Certificate.
Self-signed certificates don't work.
Note: All backends that delegate authentication to a third-party authentication server eventually
import users into the Hue database. While the metadata is stored in the database, user authentication
will still take place outside Hue.
Beeswax Configuration
In the [beeswax] section of the configuration file, you can optionally specify the following:
DB Query Configuration
The DB Query app can have any number of databases configured in the [[databases]] section under
[librdbms]. A database is known by its section name (sqlite, mysql, postgresql, and oracle as in the list
below). For details on supported databases and versions, see Supported Databases on page 21.
## name=/tmp/sqlite.db
Note: Replace with oracle # For MySQL and PostgreSQL, name is the name of the
database.
or postgresql as required. # For Oracle, Name is instance of the Oracle server. For
express edition
# this is 'xe' by default.
## name=mysqldb
# 1. MySQL: 3306
# 2. PostgreSQL: 5432
# 3. Oracle Express Edition: 1521
## port=3306
Sqoop Configuration
In the [sqoop] section of the configuration file, you can optionally specify the following:
share_jobs Indicate that jobs should be shared with all users. If set to false, they will
be visible only to the owner and administrators.
Job Designer
In the [jobsub] section of the configuration file, you can optionally specify the following:
remote_data_dir Location in HDFS where the Job Designer examples and templates are
stored.
share_jobs Indicate that workflows, coordinators, and bundles should be shared with
all users. If set to false, they will be visible only to the owner and
administrators.
HBase Configuration
In the [hbase] section of the configuration file, you can optionally specify the following:
truncate_limit Hard limit of rows or columns per row fetched before truncating.
Default: 500
hbase_clusters Comma-separated list of HBase Thrift servers for clusters in the format
of "(name|host:port)".
Default: (Cluster|localhost:9090)
HBase Impersonation: - To enable the HBase app to use impersonation, perform the following steps:
1. Ensure you have a secure HBase Thrift server.
2. Enable impersonation for the Thrift server. See Configure doAs Impersonation for the HBase Thrift Gateway.
3. Configure Hue to point to a valid HBase configuration directory. You will find this property under the [hbase]
section of the hue.ini file.
default_user_group The name of the group to which a manually created user is automatically
assigned.
Default: default.
Note: If you import users from LDAP, you must set passwords for them manually; password
information is not imported.
1. In the Hue configuration file, configure the following properties in the [[ldap]] section:
base_dn The search base for finding users and groups. base_dn="DC=mycompany,DC=com"
Note: If you provide a TLS certificate, it must be signed by a Certificate Authority that is trusted by
the LDAP server.
Important: Be aware that when you enable the LDAP back end for user authentication, user
authentication by User Admin will be disabled. This means there will be no superuser accounts to log
into Hue unless you take one of the following actions:
• Import one or more superuser accounts from Active Directory and assign them superuser
permission.
• If you have already enabled the LDAP authentication back end, log into Hue using the LDAP back
end, which will create a LDAP user. Then disable the LDAP authentication back end and use User
Admin to give the superuser permission to the new LDAP user.
After assigning the superuser permission, enable the LDAP authentication back end.
1. In the Hue configuration file, configure the following properties in the [[ldap]] section:
2. If you are using TLS or secure ports, add the following property to specify the path to a TLS certificate file:
Hadoop Configuration
The following configuration variables are under the [hadoop] section in the Hue configuration file.
webhdfs_url The HttpFS URL. The default value is the HTTP port on the NameNode.
submit_to If your Oozie is configured to use a YARN cluster, then set this to true.
Indicate that Hue should submit jobs to this YARN cluster.
The following MapReduce cluster properties are defined under the [[mapred_clusters]] sub-section:
jobtracker_host The fully-qualified domain name of the host running the JobTracker.
submit_to If your Oozie is configured with to use a 0.20 MapReduce service, then
set this to true. Indicate that Hue should submit jobs to this MapReduce
cluster.
# Enter the host on which you are running the failover JobTracker
# jobtracker_host=<localhost-ha>
[[[ha]]]
resourcemanager_host=<second_resource_manager_host_FQDN>
resourcemanager_api_url=http://<second_resource_manager_host_URL>
proxy_api_url=<second_resource_manager_proxy_URL>
history_server_api_url=<history_server_API_URL>
resourcemanager_port=<port_for_RM_IPC>
security_enabled=false
submit_to=true
logical_name=XXXX
Liboozie Configuration
In the [liboozie] section of the configuration file, you can optionally specify the following:
remote_deployment_dir The location in HDFS where the workflows and coordinators are deployed
when submitted by a non-owner.
Sentry Configuration
In the [libsentry] section of the configuration file, specify the following:
Default: /etc/sentry/conf
Hue will also automatically pick up the HiveServer2 server name from Hive's sentry-site.xml file at
/etc/hive/conf.
If you have enabled Kerberos for the Sentry service, allow Hue to connect to the service by adding the hue user
to the following property in the /etc/sentry/conf/sentry-store-site.xml file.
<property>
<name>sentry.service.allow.connect</name>
<value>impala,hive,solr,hue</value>
</property>
ZooKeeper Configuration
In the [zookeeper] section of the configuration file, you can specify the following:
rest_url The URL of the REST Contrib service (required for znode browsing).
Default: http://localhost:9998
init:
[mkdir] Created dir: /home/hue/Development/zookeeper/build/classes
[mkdir] Created dir: /home/hue/Development/zookeeper/build/lib
[mkdir] Created dir: /home/hue/Development/zookeeper/build/package/lib
[mkdir] Created dir: /home/hue/Development/zookeeper/build/test/lib
…
cd src/contrib/rest
nohup ant run&
[zookeeper]
...
[[clusters]]
...
[[[default]]]
# Zookeeper ensemble. Comma separated list of Host/Port.
# e.g. localhost:2181,localhost:2182,localhost:2183
## host_ports=localhost:2181
You should now be able to successfully run the ZooKeeper Browser app.
Administering Hue
The following sections contain details about managing and operating a Hue installation:
• Starting and Stopping the Hue Server on page 342
• Configuring Your Firewall for Hue on page 342
• Anonymous Usage Data Collection on page 342
• Managing Hue Processes on page 343
• Viewing Hue Logs on page 343
Hue Superusers and Users
Hue's User Admin application provides two levels of user privileges: superusers and users.
• Superusers — The first user who logs into Hue after its installation becomes the first superuser. Superusers
have permissions to perform administrative functions such as:
– Add and delete users
– Add and delete groups
– Assign permissions to groups
– Change a user into a superuser
– Import users and groups from an LDAP server
• Users — can change their name, email address and password. They can log in to Hue and run Hue applications,
subject to the permissions provided to the Hue groups to which they belong.
[desktop]
...
# Help improve Hue with anonymous usage analytics.
# Use Google Analytics to see how many times an application or specific section of an
If you are using an earlier version of Hue, disable this data collection by navigating to Step 3 of Hue's Quick Start
Wizard. Under Anonymous usage analytics, uncheck the Check to enable usage analytics checkbox.
Managing Hue Processes
A script called supervisor manages all Hue processes. The supervisor is a watchdog process; its only purpose
is to spawn and monitor other processes. A standard Hue installation starts and monitors the runcpserver
process.
• runcpserver – a web server that provides the core web functionality of Hue
If you have installed other applications into your Hue instance, you may see other daemons running under the
supervisor as well.
You can see the supervised processes running in the output of ps -f -u hue.
Note that the supervisor automatically restarts these processes if they fail for any reason. If the processes fail
repeatedly within a short time, the supervisor itself shuts down.
Viewing Hue Logs
You can view the Hue logs in the /var/log/hue directory, where you can find:
• An access.log file, which contains a log for all requests against the Hue Server.
• A supervisor.log file, which contains log information for the supervisor process.
• A supervisor.out file, which contains the stdout and stderr for the supervisor process.
• A .log file for each supervised process described above, which contains the logs for that process.
• A .out file for each supervised process described above, which contains the stdout and stderr for that process.
If users on your cluster have problems running Hue, you can often find error messages in these log files.
# sqlite3 /var/lib/hue/desktop.db
SQLite version 3.6.22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select username from auth_user;
admin
test
sample
sqlite>
Important: It is strongly recommended that you avoid making any modifications to the database
directly using sqlite3, though sqlite3 is useful for management or troubleshooting.
Note: Cloudera recommends you use InnoDB, not MyISAM, as your MySQL engine.
3. Open <some-temporary-file>.json and remove all JSON objects with useradmin.userprofile in the
model field. Here are some examples of JSON objects that should be deleted.
{
"pk": 1,
"model": "useradmin.userprofile",
"fields": {
"creation_method": "HUE",
"user": 1,
"home_directory": "/user/alice"
}
},
{
"pk": 2,
"model": "useradmin.userprofile",
"fields": {
"creation_method": "HUE",
"user": 1100714,
"home_directory": "/user/bob"
}
},
.....
OS Command
RHEL $ sudo yum install mysql-devel
OS Command
RHEL $ sudo yum install mysql-connector-java
OS Command
RHEL $ sudo yum install mysql-server
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
bind-address=<ip-address>
default-storage-engine=InnoDB
sql_mode=STRICT_ALL_TABLES
OS Command
RHEL $ sudo service mysqld start
10. Configure MySQL to use a strong password. In the following procedure, your current root password is blank.
Press the Enter key when you're prompted for the root password.
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
OS Command
RHEL $ sudo /sbin/chkconfig mysqld on
$ sudo /sbin/chkconfig --list mysqld
mysqld 0:off 1:off 2:on 3:on 4:on 5:on
6:off
12. Create the Hue database and grant privileges to a hue user to manage the database.
host=localhost
port=3306
engine=mysql
user=hue
password=<secretpassword>
name=hue
15. As the hue user, load the existing data and create the necessary database tables using syncdb and migrate
commands. When running these commands, Hue will try to access a logs directory, located at
/opt/cloudera/parcels/CDH/lib/hue/logs, which might be missing. If that is the case, first create the
logs directory and give the hue user and group ownership of the directory.
3. Open <some-temporary-file>.json and remove all JSON objects with useradmin.userprofile in the
model field. Here are some examples of JSON objects that should be deleted.
{
"pk": 1,
"model": "useradmin.userprofile",
"fields": {
"creation_method": "HUE",
"user": 1,
"home_directory": "/user/alice"
}
},
{
"pk": 2,
"model": "useradmin.userprofile",
"fields": {
"creation_method": "HUE",
"user": 1100714,
"home_directory": "/user/bob"
}
},
.....
OS Command
RHEL $ sudo yum install postgresql-devel gcc python-devel
OS Command
RHEL $ sudo yum install postgresql-server
OS Command
SLES $ sudo zypper install postgresql-server
$ su - postgres
# /usr/bin/postgres -D /var/lib/pgsql/data > logfile 2>&1 &
11. Create the hue database and grant privileges to a hue user to manage the database.
# psql -U postgres
postgres=# create database hue;
postgres=# \c hue;
You are now connected to database 'hue'.
postgres=# create user hue with password '<secretpassword>';
postgres=# grant all privileges on database hue to hue;
postgres=# \q
OS Command
RHEL $ sudo /sbin/chkconfig postgresql on
$ sudo /sbin/chkconfig --list postgresql
postgresql 0:off 1:off 2:on 3:on 4:on 5:on
6:off
host=localhost
port=5432
engine=postgresql_psycopg2
user=hue
password=<secretpassword>
name=hue
17. As the hue user, configure Hue to load the existing data and create the necessary database tables. You will
need to run both the migrate and syncdb commands. When running these commands, Hue will try to access
a logs directory, located at /opt/cloudera/parcels/CDH/lib/hue/logs, which might be missing. If that
is the case, first create the logs directory and give the hue user and group ownership of the directory.
bash# su - postgres
$ psql -h localhost -U hue -d hue
postgres=# \d auth_permission;
19. Drop the foreign key that you retrieved in the previous step.
bash# su - postgres
$ psql -h localhost -U hue -d hue
postgres=# ALTER TABLE auth_permission ADD CONSTRAINT
content_type_id_refs_id_<XXXXXX> FOREIGN KEY (content_type_id) REFERENCES
django_content_type(id) DEFERRABLE INITIALLY DEFERRED;
Important: Configure the database for character set AL32UTF8 and national character set UTF8.
1. Ensure Python 2.6 or newer is installed on the server Hue is running on.
2. Download the Oracle client libraries at Instant Client for Linux x86-64 Version 11.1.0.7.0, Basic and SDK (with
headers) zip files to the same directory.
3. Unzip the zip files.
$ cd $ORACLE_HOME
$ ln -sf libclntsh.so.11.1 libclntsh.so
8. Edit the Hue configuration file hue.ini. Directly below the [[database]] section under the [desktop] line,
add the following options (and modify accordingly for your setup):
host=localhost
port=1521
engine=oracle
user=hue
password=<secretpassword>
name=<SID of the Oracle database, for example, 'XE'>
To use the Oracle service name instead of the SID, use the following configuration instead:
port=0
engine=oracle
user=hue
password=password
name=oracle.example.com:1521/orcl.example.com
The directive port=0 allows Hue to use a service name. The name string is the connect string, including
hostname, port, and service name.
To add support for a multithreaded environment, set the threaded option to true under the
[desktop]>[[database]] section.
options={'threaded':true}
10. As the hue user, configure Hue to load the existing data and create the necessary database tables. You will
need to run both the syncdb and migrate commands. When running these commands, Hue will try to access
commit;
KMS Installation
Hadoop Key Management Service (KMS) is a cryptographic key management server based on Hadoop's KeyProvider
API. It provides a client which is a KeyProvider implementation that interacts with the KMS using the HTTP REST
API. Both the KMS and its client support HTTP SPNEGO Kerberos authentication and SSL-secured communication.
The KMS is a Java-based web application which runs using a pre-configured Tomcat server bundled with the
Hadoop distribution.
Installing and Upgrading KMS
Problem
The problem occurs when you try to upgrade the hadoop-kms package, for example:
/var/cache/zypp/packages/cdh/RPMS/x86_64/hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11.x86_64.rpm:
Header V4 DSA signature: NOKEY, key ID e8f86acd
12:54:19 error: %postun(hadoop-kms-2.5.0+cdh5.3.1+791-1.cdh5.3.1.p0.17.sles11.x86_64)
scriptlet failed, exit status 1
12:54:19
Note:
• The hadoop-kms package is not installed automatically with CDH, so you will encounter this error
only if you are explicitly upgrading an existing version of KMS.
• The examples in this section show an upgrade from CDH 5.3.x; the 5.2.x case looks very similar.
What to Do
If you see an error similar to the one in the example above, proceed as follows:
1. Abort, or ignore the error (it doesn't matter which):
2. Perform cleanup.
a. # rpm -qa hadoop-kms
You will see two versions of hadoop-kms; for example:
hadoop-kms-2.5.0+cdh5.3.1+791-1.cdh5.3.1.p0.17.sles11
hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11
3. Verify that the older version of the package has been removed:
hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11
Mahout Installation
Apache Mahout is a machine-learning tool. By enabling you to build machine-learning libraries that are scalable
to "reasonably large" datasets, it aims to make building intelligent applications easier and faster.
Note:
To see which version of Mahout is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Important:
If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using
the instructions below to install Mahout. For instructions, see Installing the Latest CDH 5 Release on
page 166.
Note:
To see which version of Mahout is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Upgrading Mahout from an Earlier CDH 5 Release to the Latest CDH 5 Release
To upgrade Mahout to the latest release, simply install the new version; see Installing Mahout on page 354.
Installing Mahout
You can install Mahout from an RPM or Debian package, or from a tarball.
Note:
To see which version of Mahout is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Installing from packages is more convenient than installing the tarball because the packages:
• Handle dependencies
• Provide for easy upgrades
• Automatically install resources to conventional locations
These instructions assume that you will install from packages if possible.
Oozie Installation
About Oozie
Apache Oozie Workflow Scheduler for Hadoop is a workflow and coordination service for managing Apache
Hadoop jobs:
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions; actions are typically Hadoop jobs
(MapReduce, Streaming, Pipes, Pig, Hive, Sqoop, etc).
• Oozie Coordinator jobs trigger recurrent Workflow jobs based on time (frequency) and data availability.
• Oozie Bundle jobs are sets of Coordinator jobs managed as a single job.
Oozie is an extensible, scalable and data-aware service that you can use to orchestrate dependencies among
jobs running on Hadoop.
• To find out more about Oozie, see http://archive.cloudera.com/cdh5/cdh/5/oozie/.
• To install or upgrade Oozie, follow the directions on this page.
Oozie Packaging
There are two packaging options for installing Oozie:
• Separate RPM packages for the Oozie server (oozie) and client (oozie-client)
• Separate Debian packages for the Oozie server (oozie) and client (oozie-client)
You can also download an Oozie tarball.
Oozie Prerequisites
• Prerequisites for installing Oozie server:
– An operating system supported by CDH 5
– Oracle JDK
– A supported database if you are not planning to use the default (Derby).
• Prerequisites for installing Oozie client:
– Oracle JDK
Note:
• To see which version of Oozie is shipping in CDH 5,check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Upgrading Oozie
Follow these instructions to upgrade Oozie to CDH 5 from RPM or Debian Packages.
Upgrading Oozie from CDH 4 to CDH 5
To upgrade Oozie from CDH 4 to CDH 5, back up the configuration files and database, uninstall the CDH 4 version
and then install and configure the CDH 5 version. Proceed as follows.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH
5 version of Oozie.
3. Uninstall Oozie.
To uninstall Oozie, run the appropriate command on each host:
• On RHEL-compatible systems:
• On SLES systems:
Installing Oozie
Oozie is distributed as two separate packages; a client package (oozie-client) and a server package (oozie).
Depending on what you are planning to install, choose the appropriate packages and install them using your
preferred package manager application.
Note:
The Oozie server package, oozie, is preconfigured to work with MRv2 (YARN). To configure the Oozie
server to work with MRv1, see Configuring the Hadoop Version to Use.
To install the Oozie server package on an Ubuntu and other Debian system:
To install the Oozie client package on an Ubuntu and other Debian system:
Note:
Installing the oozie package creates an oozie service configured to start Oozie at system startup
time.
You are now ready to configure Oozie. See the next section.
Configuring Oozie
This section explains how to configure which Hadoop version to use, and provides separate procedures for each
of the following:
• Configuring Oozie after Upgrading from CDH 4 on page 360
• Configuring Oozie after Upgrading from an Earlier CDH 5 Release on page 362
• Configuring Oozie after a New Installation on page 365.
Configuring which Hadoop Version to Use
The Oozie client does not interact directly with Hadoop MapReduce, and so it does not require any MapReduce
configuration.
The Oozie server can work with either MRv1 or YARN. It cannot work with both simultaneously.
You set the MapReduce version the Oozie server works with by means of the alternatives command (or
update-alternatives, depending on your operating system). As well as distinguishing between YARN and
MRv1, the commands differ depending on whether or not you are using SSL.
• To use YARN (without SSL):
Important: If you are upgrading from a release earlier than CDH 5 Beta 2
In earlier releases, the mechanism for setting the MapReduce version was the CATALINA_BASE variable
in /etc/oozie/conf/oozie-env.sh. This does not work as of CDH 5 Beta 2, and in fact could cause
problems. Check your /etc/oozie/conf/oozie-env.sh and make sure you have the new version.
The new version contains the line:
export CATALINA_BASE=/var/lib/oozie/tomcat-deployment
Note: If you are installing Oozie for the first time, skip this section and proceed with Configuring Oozie
after a New Installation.
Important: Do not copy over the CDH 4 configuration files into the CDH 5 configuration directory.
2. If necessary do the same for the oozie-log4j.properties, oozie-env.sh and the adminusers.txt files.
Important:
• Do not proceed before you have edited the configuration files as instructed in Step 1.
• Before running the database upgrade tool, copy or symbolically link the JDBC driver JAR for the
database you are using into the /var/lib/oozie/ directory.
Oozie CDH 5 provides a command-line tool to perform the database schema and data upgrade that is required
when you upgrade Oozie from CDH 4 to CDH 5. The tool uses Oozie configuration files to connect to the database
and perform the upgrade.
The database upgrade tool works in two modes: it can do the upgrade in the database or it can produce an SQL
script that a database administrator can run manually. If you use the tool to perform the upgrade, you must do
it as a database user who has permissions to run DDL operations in the Oozie database.
• To run the Oozie database upgrade tool against the database:
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or
work properly because of incorrect file permissions.
You will see output such as this (the output of the script may differ slightly depending on the database
vendor):
Validate DB Connection
DONE
Check DB schema exists
DONE
Verify there are not active Workflow Jobs
DONE
Check OOZIE_SYS table does not exist
DONE
Get Oozie DB version
DONE
Upgrade SQL schema
DONE
Upgrading to db schema for Oozie 4.0
Update db.version in OOZIE_SYS table to 2
DONE
Post-upgrade COORD_JOBS new columns default values
DONE
Post-upgrade COORD_JOBS & COORD_ACTIONS status values
DONE
Post-upgrade MISSING_DEPENDENCIES column in Derby
DONE
Table 'WF_ACTIONS' column 'execution_path', length changed to 1024
Table 'WF_ACTIONS, column 'error_message', changed to varchar/varchar2
Table 'COORD_JOB' column 'frequency' changed to varchar/varchar2
DONE
Post-upgrade BUNDLE_JOBS, COORD_JOBS, WF_JOBS to drop AUTH_TOKEN column
DONE
Upgrading to db schema for Oozie 4.0.0-cdh5.0.0
Update db.version in OOZIE_SYS table to 3
DONE
Dropping discriminator column
DONE
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or
work properly because of incorrect file permissions.
For example:
You should see output such as the following (the output of the script may differ slightly depending on the
database vendor):
Validate DB Connection
DONE
Check DB schema exists
DONE
Verify there are not active Workflow Jobs
DONE
Check OOZIE_SYS table does not exist
DONE
Get Oozie DB version
DONE
Upgrade SQL schema
DONE
Upgrading to db schema for Oozie 4.0
Update db.version in OOZIE_SYS table to 2
DONE
Post-upgrade COORD_JOBS new columns default values
DONE
Post-upgrade COORD_JOBS & COORD_ACTIONS status values
DONE
Post-upgrade MISSING_DEPENDENCIES column in Derby
DONE
Table 'WF_ACTIONS' column 'execution_path', length changed to 1024
Table 'WF_ACTIONS, column 'error_message', changed to varchar/varchar2
Table 'COORD_JOB' column 'frequency' changed to varchar/varchar2
DONE
Post-upgrade BUNDLE_JOBS, COORD_JOBS, WF_JOBS to drop AUTH_TOKEN column
DONE
Upgrading to db schema for Oozie 4.0.0-cdh5.0.0
Update db.version in OOZIE_SYS table to 3
DONE
Dropping discriminator column
DONE
WARN: The SQL commands have NOT been executed, you must use the '-run' option
Important: If you used the -sqlfile option instead of -run, Oozie database schema has not been
upgraded. You need to run the oozie-upgrade script against your database.
Important: This step is required; CDH 5 Oozie does not work with CDH 4 shared libraries.
CDH 5 Oozie has a new shared library which bundles CDH 5 JAR files for streaming, DistCp and for Pig, Hive,
HiveServer 2, Sqoop, and HCatalog.
Note:
The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you
install the right one for the MapReduce version you are using:
• The shared library file for YARN is oozie-sharelib-yarn.
• The shared library file for MRv1 is oozie-sharelib-mr1.
where FS_URI is the HDFS URI of the filesystem that the shared library should be installed on (for example,
hdfs://HOST:PORT).
Important: If you are installing Oozie to work with MRv1, make sure you use oozie-sharelib-mr1
instead.
Note: If you are installing Oozie for the first time, skip this section and proceed with Configuring Oozie
after a New Installation on page 365.
2. If necessary do the same for the oozie-log4j.properties, oozie-env.sh and the adminusers.txt files.
Important:
• Do not proceed before you have edited the configuration files as instructed in Step 1.
• Before running the database upgrade tool, copy or symbolically link the JDBC driver JAR for the
database you are using into the /var/lib/oozie/ directory.
Oozie CDH 5 provides a command-line tool to perform the database schema and data upgrade. The tool uses
Oozie configuration files to connect to the database and perform the upgrade.
The database upgrade tool works in two modes: it can do the upgrade in the database or it can produce an SQL
script that a database administrator can run manually. If you use the tool to perform the upgrade, you must do
it as a database user who has permissions to run DDL operations in the Oozie database.
• To run the Oozie database upgrade tool against the database:
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or
work properly because of incorrect file permissions.
You will see output such as this (the output of the script may differ slightly depending on the database
vendor):
Validate DB Connection
DONE
Check DB schema exists
DONE
Verify there are not active Workflow Jobs
DONE
Check OOZIE_SYS table does not exist
DONE
Get Oozie DB version
DONE
Upgrade SQL schema
DONE
Upgrading to db schema for Oozie 4.0.0-cdh5.0.0
Update db.version in OOZIE_SYS table to 3
DONE
Converting text columns to bytea for all tables
DONE
Get Oozie DB version
DONE
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or
work properly because of incorrect file permissions.
For example:
You should see output such as the following (the output of the script may differ slightly depending on the
database vendor):
Validate DB Connection
DONE
Check DB schema exists
DONE
Verify there are not active Workflow Jobs
DONE
Check OOZIE_SYS table does not exist
DONE
Get Oozie DB version
DONE
Upgrade SQL schema
DONE
Upgrading to db schema for Oozie 4.0.0-cdh5.0.0
Update db.version in OOZIE_SYS table to 3
DONE
Converting text columns to bytea for all tables
DONE
Get Oozie DB version
DONE
WARN: The SQL commands have NOT been executed, you must use the '-run' option
Important: If you used the -sqlfile option instead of -run, Oozie database schema has not been
upgraded. You need to run the oozie-upgrade script against your database.
Important: This step is required; the current version of Oozie does not work with shared libraries
from an earlier version.
The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you install the
right one for the MapReduce version you are using:
• The shared library file for YARN is oozie-sharelib-yarn.
• The shared library file for MRv1 is oozie-sharelib-mr1.
To upgrade the shared library, proceed as follows.
1. Delete the Oozie shared libraries from HDFS. For example:
Note:
• If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they
will fail with a security error. Instead, use the following commands: $ kinit <user> (if you
are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab)
and then, for each command executed by this user, $ <command>
• If the current shared libraries are in another location, make sure you use this other location
when you run the above command(s).
where FS_URI is the HDFS URI of the filesystem that the shared library should be installed on (for example,
hdfs://<HOST>:<PORT>).
Important:
If you are installing Oozie to work with MRv1, make sure you use oozie-sharelib-mr1 instead.
Note: Follow the instructions in this section if you are installing Oozie for the first time. If you are
upgrading Oozie from CDH 4 or from an earlier CDH 5 release, skip this subsection and choose the
appropriate instructions earlier in this section: Configuring Oozie after Upgrading from CDH 4 on page
360 or Configuring Oozie after Upgrading from an Earlier CDH 5 Release on page 362.
When you install Oozie from an RPM or Debian package, Oozie server creates all configuration, documentation,
and runtime files in the standard Linux directories, as follows.
configuration /etc/oozie/conf/
data /var/lib/oozie/
logs /var/log/oozie/
temp /var/tmp/oozie/
$ psql -U postgres
Password for user postgres: *****
postgres=# \q
...
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:postgresql://localhost:5432/oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>oozie</value>
</property>
...
Note: In the JDBC URL property, replace localhost with the hostname where PostgreSQL is running.
In the case of PostgreSQL, unlike MySQL or Oracle, there is no need to download and install the JDBC
driver separately, as it is license-compatible with Oozie and bundled with it.
$ mysql -u root -p
Enter password: ******
mysql> exit
Bye
...
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:mysql://localhost:3306/oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>oozie</value>
</property>
...
Note: In the JDBC URL property, replace localhost with the hostname where MySQL is running.
Note: You must manually download the MySQL JDBC driver JAR file.
$ sqlplus system@localhost
SQL> create user oozie identified by oozie default tablespace users temporary tablespace
temp;
User created.
SQL> exit
Important:
Do not make the following grant:
...
<property>
<name>oozie.service.JPAService.jdbc.driver</name>
<value>oracle.jdbc.OracleDriver</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.url</name>
<value>jdbc:oracle:thin:@//myhost:1521/oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.username</name>
<value>oozie</value>
</property>
<property>
<name>oozie.service.JPAService.jdbc.password</name>
<value>oozie</value>
</property>
...
Note: In the JDBC URL property, replace myhost with the hostname where Oracle is running and
replace oozie with the TNS name of the Oracle database.
Note: You must manually download the Oracle JDBC driver JAR file.
Note: The Oozie database tool uses Oozie configuration files to connect to the database to perform
the schema creation; before you use the tool, make you have created a database and configured Oozie
to work with it as described above.
The Oozie database tool works in 2 modes: it can create the database, or it can produce an SQL script that a
database administrator can run to create the database manually. If you use the tool to create the database
schema, you must have the permissions needed to execute DDL operations.
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work
properly because of incorrect file permissions.
You should see output such as the following (the output of the script may differ slightly depending on the
database vendor) :
Validate DB Connection.
DONE
Check DB schema does not exist
DONE
Check OOZIE_SYS table does not exist
DONE
Create SQL schema
DONE
DONE
Create OOZIE_SYS table
DONE
Oozie DB has been created for Oozie version '4.0.0-cdh5.0.0'
Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work
properly because of incorrect file permissions.
You should see output such as the following (the output of the script may differ slightly depending on the
database vendor) :
Validate DB Connection.
DONE
Check DB schema does not exist
DONE
Check OOZIE_SYS table does not exist
DONE
Create SQL schema
DONE
DONE
Create OOZIE_SYS table
DONE
WARN: The SQL commands have NOT been executed, you must use the '-run' option
Important: If you used the -sqlfile option instead of -run, Oozie database schema has not been
created. You must run the oozie-create.sql script against your database.
Important: If Hadoop is configured with Kerberos security enabled, you must first configure Oozie
with Kerberos Authentication. For instructions, see Oozie Security Configuration. Before running the
commands in the following instructions, you must run the sudo -u oozie kinit -k -t
/etc/oozie/oozie.keytab and kinit -k hdfs commands. Then, instead of using commands in
the form sudo -u user command, use just command; for example, $ hadoop fs -mkdir
/user/oozie
To install the Oozie shared library in Hadoop HDFS in the oozie user home directory
where FS_URI is the HDFS URI of the filesystem that the shared library should be installed on (for example,
hdfs://<HOST>:<PORT>).
Important: If you are installing Oozie to work with MRv1 use oozie-sharelib-mr1 instead.
...
<property>
<name>oozie.action.mapreduce.uber.jar.enable</name>
<value>true</value>
...
When this property is set, users can use the oozie.mapreduce.uber.jar configuration property in their
MapReduce workflows to notify Oozie that the specified JAR file is an uber JAR.
Configuring Oozie to Run against a Federated Cluster
To run Oozie against a federated HDFS cluster using ViewFS, configure the
oozie.service.HadoopAccessorService.supported.filesystems property in oozie-site.xml as
follows:
<property>
<name>oozie.service.HadoopAccessorService.supported.filesystems</name>
<value>hdfs,viewfs</value>
</property>
If you see the message Oozie System ID [oozie-oozie] started in the oozie.log log file, the system has
started successfully.
Note:
By default, Oozie server runs on port 11000 and its URL is http://<OOZIE_HOSTNAME>:11000/oozie.
To make it convenient to use this utility, set the environment variable OOZIE_URL to point to the URL of the
Oozie server. Then you can skip the -oozie option.
For example, if you want to invoke the client on the same machine where the Oozie server is running, set the
OOZIE_URL to http://localhost:11000/oozie.
$ export OOZIE_URL=http://localhost:11000/oozie
$ oozie admin -version
Oozie server build version: 4.0.0-cdh5.0.0
Important:
If Oozie is configured with Kerberos Security enabled:
• You must have a Kerberos session running. For example, you can start a session by running the
kinit command.
$ export OOZIE_URL=http://myoozieserver.mydomain.com:11000/oozie
If you use an alternate hostname or the IP address of the service, Oozie will not work properly.
Note:
If the Oozie server is configured to use Kerberos HTTP SPNEGO Authentication, you must use a web
browser that supports Kerberos HTTP SPNEGO (for example, Firefox or Internet Explorer).
Note:
The functionality described below is supported in CDH 5, but Cloudera recommends that you use the
new capabilities introduced in CDH 5 instead.
1. Set up your database for High Availability (see the database documentation for details).
Note:
Oozie database configuration properties may need special configuration (see the JDBC driver
documentation for details).
Pig Installation
Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin
queries run in a distributed way on a Hadoop cluster.
Use the following sections to install or upgrade Pig:
• Upgrading Pig
• Installing Pig
• Using Pig with HBase
• Installing DataFu
• Apache Pig Documentation
Upgrading Pig
Note:
To see which version of Pig is shipping in CDH 5, check the Version and Packaging Information. For
important information on new and changed components, see the Release Notes.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH
5 version of Pig.
Installing Pig
Note:
Pig automatically uses the active Hadoop configuration (whether standalone, pseudo-distributed
mode, or distributed). After installing the Pig package, you can start Pig.
Important:
• For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running
Pig, Hive, or Sqoop in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment
variable is set correctly, as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
• For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running
Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable
as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
$ pig
$ pig
Examples
To verify that the input and output directories from the YARN or MRv1 example grep job exist, list an HDFS
directory from the Grunt Shell:
grunt> ls
hdfs://localhost/user/joe/input <dir>
hdfs://localhost/user/joe/output <dir>
Note:
To check the status of your job while it is running, look at the ResourceManager web console (YARN)
or JobTracker web console (MRv1).
register /usr/lib/zookeeper/zookeeper-<ZooKeeper_version>-cdh<CDH_version>.jar
register /usr/lib/hbase/hbase-<HBase_version>-cdh<CDH_version>-security.jar
For example,
register /usr/lib/zookeeper/zookeeper-3.4.5-cdh5.0.0.jar
register /usr/lib/hbase/hbase-0.95.2-cdh5.0.0-security.jar
In addition, Pig needs to be able to access the hbase-site.xml file on the Hadoop client. Pig searches for the
file within the /etc/hbase/conf directory on the client, or in Pig's CLASSPATH variable.
For more information about using Pig with HBase, see Importing Data Into HBase.
Installing DataFu
DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation that were developed
by LinkedIn and have now been open sourced under an Apache 2.0 license.
To use DataFu:
1. Install the DataFu package:
REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar
For example,
REGISTER /usr/lib/pig/datafu-0.0.4-cdh5.0.0.jar
Search Installation
This documentation describes how to install Cloudera Search powered by Solr. It also explains how to install
and start supporting tools and services such as the ZooKeeper Server, MapReduce tools for use with Cloudera
Search, and Flume Solr Sink.
After installing Cloudera Search as described in this document, you can configure and use Cloudera Search as
described in the Cloudera Search User Guide. The user guide includes the Cloudera Search Tutorial, as well as
topics that describe extracting, transforming, and loading data, establishing high availability, and troubleshooting.
Cloudera Search documentation includes:
• CDH 5 Release Notes
• CDH Version and Packaging Information
• Cloudera Search User Guide
• Cloudera Search Frequently Asked Questions
Preparing to Install Cloudera Search
Cloudera Search provides interactive search and scalable indexing. Before you begin installing Cloudera Search:
• Decide whether to install Cloudera Search using Cloudera Manager or using package management tools.
• Decide on which machines to install Cloudera Search and with which other services to collocate Search.
• Consider the sorts of tasks, workloads, and types of data you will be searching. This information can help
guide your deployment process.
Choosing Where to Deploy the Cloudera Search Processes
You can collocate a Cloudera Search server (solr-server package) with a Hadoop TaskTracker (MRv1) and a
DataNode. When collocating with TaskTrackers, be sure that the machine resources are not oversubscribed.
Start with a small number of MapReduce slots and increase them gradually.
For instructions describing how and where to install solr-mapreduce, see Installing MapReduce Tools for use
with Cloudera Search. For information about the Search package, see the Using Cloudera Search section in the
Cloudera Search Tutorial.
Guidelines for Deploying Cloudera Search
Memory
CDH initially deploys Solr with a Java virtual machine (JVM) size of 1 GB. In the context of Search, 1 GB is a small
value. Starting with this small value simplifies JVM deployment, but the value is insufficient for most actual use
cases. Consider the following when determining an optimal JVM size for production usage:
• The more searchable material you have, the more memory you need. All things being equal, 10 TB of searchable
data requires more memory than 1 TB of searchable data.
• What is indexed in the searchable material. Indexing all fields in a collection of logs, email messages, or
Wikipedia entries requires more memory than indexing only the Date Created field.
• The level of performance required. If the system must be stable and respond quickly, more memory may
help. If slow responses are acceptable, you may be able to use less memory.
To ensure an appropriate amount of memory, consider your requirements and experiment in your environment.
In general:
• 4 GB is sufficient for some smaller loads or for evaluation.
• 12 GB is sufficient for some production environments.
• 48 GB is sufficient for most situations.
Deployment Requirements
The information in this topic should be considered as guidance instead of absolute requirements. Using a sample
application to benchmark different use cases and data types and sizes can help you identify the most important
performance factors.
To determine how best to deploy search in your environment, define use cases. The same Solr index can have
very different hardware requirements, depending on queries performed. The most common variation in hardware
requirement is memory. For example, the memory requirements for faceting vary depending on the number of
unique terms in the faceted field. Suppose you want to use faceting on a field that has ten unique values. In
this case, only ten logical containers are required for counting. No matter how many documents are in the index,
memory overhead is almost nonexistent.
Conversely, the same index could have unique timestamps for every entry, and you want to facet on that field
with a : -type query. In this case, each index requires its own logical container. With this organization, if you
had a large number of documents—500 million, for example—then faceting across 10 fields would increase the
RAM requirements significantly.
For this reason, use cases and some characterizations of the data is required before you can estimate hardware
requirements. Important parameters to consider are:
• Number of documents. For Cloudera Search, sharding is almost always required.
• Approximate word count for each potential field.
• What information is stored in the Solr index and what information is only for searching. Information stored
in the index is returned with the search results.
• Foreign language support:
– How many different languages appear in your data?
– What percentage of documents are in each language?
– Is language-specific search supported? This determines whether accent folding and storing the text in a
single field is sufficient.
– What language families will be searched? For example, you could combine all Western European languages
into a single field, but combining English and Chinese into a single field is not practical. Even with more
similar sets of languages, using a single field for different languages can be problematic. For example,
sometimes accents alter the meaning of a word, and in such a case, accent folding loses important
distinctions.
• Faceting requirements:
– Be wary of faceting on fields that have many unique terms. For example, faceting on timestamps or
free-text fields typically has a high cost. Faceting on a field with more than 10,000 unique values is typically
not useful. Ensure that any such faceting requirement is necessary.
– What types of facets are needed? You can facet on queries as well as field values. Faceting on queries is
often useful for dates. For example, “in the last day” or “in the last week” can be valuable. Using Solr Date
Math to facet on a bare “NOW” is almost always inefficient. Facet-by-query is not memory-intensive
because the number of logical containers is limited by the number of queries specified, no matter how
many unique values are in the underlying field. This can enable faceting on fields that contain information
such as dates or times, while avoiding the problem described for faceting on fields with unique terms.
• Sorting requirements:
– Sorting requires one integer for each document (maxDoc), which can take up significant memory.
Additionally, sorting on strings requires storing each unique string value.
• Is an “advanced” search capability planned? If so, how will it be implemented? Significant design decisions
depend on user motivation levels:
– Can users be expected to learn about the system? “Advanced” screens could intimidate e-commerce
users, but these screens may be most effective if users can be expected to learn them.
– How long will your users wait for results? Data mining results in longer user wait times. You want to limit
user wait times, but other design requirements can affect response times.
Note: Depending on which installation approach you use, Search is installed to different locations.
• Installing Search with Cloudera Manager using parcels results in changes under
/opt/cloudera/parcels.
• Installing using packages, either manually or using Cloudera Manager, results in changes to various
locations throughout the file system. Common locations for changes include /usr/lib/,
/etc/default/, and /usr/share/doc/.
Note: This page describe how to install CDH using packages as well as how to install CDH using
Cloudera Manager.
You can also install Cloudera Search manually in some situations; for example, y if you have an existing installation
to which you want to add Search.
To use CDH 5, which includes Cloudera Search:
• For general information about using repositories to install or upgrade Cloudera software, see Understanding
Custom Installation Solutions in Understanding Custom Installation Solutions.
• For instructions on installing or upgrading CDH, see CDH 5 Installation and the instructions for Upgrading
from CDH 4 to CDH 5.
• For CDH 5 repository locations and client .repo files, which include Cloudera Search, see CDH Version and
Packaging Information.
Cloudera Search provides the following packages:
solr Solr
Important:
• Running services: When starting, stopping, and restarting CDH components, always use the
service (8) command instead of running /etc/init.d scripts directly. This is important because
service sets the current working directory to the root directory (/) and removes environment
variables except LANG and TERM. This creates a predictable environment in which to administer
the service. If you use /etc/init.d scripts directly, any environment variables continue to be
applied, potentially causing unexpected results. If you install CDH from packages, service is
installed as part of the Linux Standard Base (LSB).
• Install the Cloudera repository: Before using the instructions in this guide to install or upgrade
Cloudera Search from packages, install the Cloudera yum, zypper/YaST or apt repository, and
install or upgrade CDH and make sure it is functioning correctly.
Cloudera Search packages are configured according to the Linux Filesystem Hierarchy Standard.
Next, enable the server daemons you want to use with Hadoop. You can also enable Java-based client access
by adding the JAR files in /usr/lib/solr/ and /usr/lib/solr/lib/ to your Java class path.
Deploying Cloudera Search
When you deploy Cloudera Search, SolrCloud partitions your data set into multiple indexes and processes, using
ZooKeeper to simplify management, resulting in a cluster of coordinating Solr servers.
Edit the property to configure the hosts with the address of the ZooKeeper service. You must make this
configuration change for every Solr Server host. The following example shows a configuration with three ZooKeeper
hosts:
SOLR_ZK_ENSEMBLE=<zkhost1>:2181,<zkhost2>:2181,<zkhost3>:2181/solr
SOLR_HDFS_HOME=hdfs://namenodehost:8020/solr
Replace namenodehost with the hostname of your HDFS NameNode (as specified by fs.default.name or
fs.defaultFS in your conf/core-site.xml file). You may also need to change the port number from the
default (8020). On an HA-enabled cluster, ensure that the HDFS URI you use reflects the designated name
service utilized by your cluster. This value should be reflected in fs.default.name; instead of a hostname,
you would see hdfs://nameservice1 or something similar.
2. In some cases, such as for configuring Solr to work with HDFS High Availability (HA), you may want to configure
the Solr HDFS client by setting the HDFS configuration directory in /etc/default/solr or
/opt/cloudera/parcels/CDH-*/etc/default/solr. On every Solr Server host, locate the appropriate
HDFS configuration directory and edit the following property with the absolute path to this directory :
SOLR_HDFS_CONFIG=/etc/hadoop/conf
Replace the path with the correct directory containing the proper HDFS configuration files, core-site.xml
and hdfs-site.xml.
For more information, see Step 4: Create and Deploy the Kerberos Principals and Keytab Files
SOLR_KERBEROS_ENABLED=true
SOLR_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
SOLR_KERBEROS_PRINCIPAL=solr/[email protected]
$ solrctl init
Warning: solrctl init takes a --force option as well. solrctl init --force clears the Solr
data in ZooKeeper and interferes with any running hosts. If you clear Solr data from ZooKeeper to
start over, be sure to stop the cluster first.
Starting Solr
To start the cluster, start Solr Server on each host:
After you have started the Cloudera Search Server, the Solr server should be running. To verify that all daemons
are running, use the jps tool from the Oracle JDK, which you can obtain from the Java SE Downloads page. If
you are running a pseudo-distributed HDFS installation and a Solr search installation on one machine, jps
shows the following output:
You can customize it by directly editing the solrconfig.xml and schema.xml files created in
$HOME/solr_configs/conf.
These configuration files are compatible with the standard Solr tutorial example documents.
After configuration is complete, you can make it available to Solr by issuing the following command, which
uploads the content of the entire instance directory to ZooKeeper:
Use the solrctl tool to verify that your instance directory uploaded successfully and is available to ZooKeeper.
List the contents of an instance directory as follows:
If you used the earlier --create command to create collection1, the --list command should return
collection1.
Important:
If you are familiar with Apache Solr, you might configure a collection directly in solr home:
/var/lib/solr. Although this is possible, Cloudera recommends using solrctl instead.
You should be able to check that the collection is active. For example, for the server myhost.example.com, you
should be able to navigate to
http://myhost.example.com:8983/solr/collection1/select?q=*%3A*&wt=json&indent=true and
verify that the collection is active. Similarly, you should be able to view the topology of your SolrCloud using a
URL similar to http://myhost.example.com:8983/solr/#/~cloud.
Adding Another Collection with Replication
To support scaling for the query load, create a second collection with replication. Having multiple servers with
replicated collections distributes the request load for each shard. Create one shard cluster with a replication
factor of two. Your cluster must have at least two running servers to support this configuration, so ensure
Cloudera Search is installed on at least two servers. A replication factor of two causes two copies of the index
files to be stored in two different locations.
1. Generate the config files for the collection:
4. Verify that the collection is live and that the one shard is served by two hosts. For example, for the server
myhost.example.com, you should receive content from:
http://myhost.example.com:8983/solr/#/~cloud.
Creating Replicas of Existing Shards
You can create additional replicas of existing shards using a command of the following form:
For example to create a new replica of collection named collection1 that is comprised of shard1, use the
following command:
Where:
• target_solr_server: The server to host the new shard
• core_name: <collection_name><shard_id><replica_id>
• shard_id: New shard identifier
For example, to add a new second shard named shard2 to a solr server named mySolrServer, where the
collection is named myCollection, you would use the following command:
For information on using Spark to batch index documents, see the Spark Indexing Reference (CDH 5.2 or later
only).
Installing MapReduce Tools for use with Cloudera Search
Cloudera Search provides the ability to batch index documents using MapReduce jobs. Install the solr-mapreduce
package on hosts where you want to submit a batch indexing job.
For information on using MapReduce to batch index documents, see the MapReduce Batch Indexing Reference.
Installing the Lily HBase Indexer Service
To query data stored in HBase, you must install the Lily HBase Indexer service. This service indexes the stream
of records being added to HBase tables. This process is scalable, fault tolerant, transactional, and operates at
near real-time (NRT). The typical delay is a few seconds between the time data arrives and the time the same
data appears in search results.
Choosing where to Deploy the Lily HBase Indexer Service Processes
To accommodate the HBase ingest load, you can run as many Lily HBase Indexer services on different hosts as
required. See the HBase replication documentation for details on how to plan the capacity. You can co-locate
Lily HBase Indexer service processes with SolrCloud on the same set of hosts.
To install the Lily HBase Indexer service on RHEL systems:
To install the Lily HBase Indexer service on Ubuntu and Debian systems:
Important: For the Lily HBase Indexer to work with CDH 5, you may need to run the following command
before issuing Lily HBase MapReduce jobs:
The update process is different for Search 1.x and Search for CDH 5. With Search 1.x, Search is a separate package
from CDH. Therefore, to upgrade from Search 1.x, you must upgrade to CDH 5, which includes Search as part of
the CDH 5 repository.
Important: Before upgrading, make backup copies of the following configuration files:
• /etc/default/solr or /opt/cloudera/parcels/CDH-*/etc/default/solr
• All collection configurations
Make sure you copy every host that is part of the SolrCloud.
• If you are running CDH 4 and want to upgrade to Search for CDH 5, see Upgrading Search 1.x to Search for
CDH 5 on page 388.
• Cloudera Search for CDH 5 is included as part of CDH 5. Therefore, to upgrade from previous versions of
Cloudera Search for CDH 5 to the latest version of Cloudera Search, simply upgrade CDH. For more information,
see Upgrading from an Earlier CDH 5 Release to the Latest Release on page 590.
Upgrading Search 1.x to Search for CDH 5
If you are running Cloudera Manager, you must upgrade to Cloudera Manager 5 to run CDH 5. Because Search
1.x is in a separate repository from CDH 4, you must remove the Search 1.x packages and the Search .repo or
.list file before upgrading CDH. This is true whether or not you are upgrading through Cloudera Manager.
1. Check which packages are installed using one of the following commands, depending on your operating
system:
2. Remove the packages using the appropriate remove command for your OS. For example:
SLES /etc/zypp/repos.d/cloudera-search.repo
• To upgrade without using Cloudera Manager, see Upgrading from CDH 4 to CDH 5.
4. If you upgraded to CDH 5 without using Cloudera Manager, you need to install the new version of Search:
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py \
--install /usr/share/hue/apps/search
SOLR_SECURITY_ALLOWED_PROXYUSERS=hue
SOLR_SECURITY_PROXYUSER_hue_HOSTS=*
SOLR_SECURITY_PROXYUSER_hue_GROUPS=*
For more information about Secure Impersonation or to set up additional users for Secure Impersonation,
see Enabling Secure Impersonation.
5. (Optional) To view files in HDFS, ensure that the correct webhdfs_url is included in hue.ini and WebHDFS
is properly configured as described in Configuring CDH Components for Hue.
6. Restart Hue:
$ cd /usr/share/hue
$ sudo tar -xzvf hue-search-####.tar.gz
$ sudo /usr/share/hue/tools/app_reg/app_reg.py \
--install /usr/share/hue/apps/search
2. Restart Hue:
Sentry Installation
Sentry enables role-based, fine-grained authorization for HiveServer2 and Cloudera Impala. It provides classic
database-style authorization for Hive and Impala. For more information, and instructions on configuring Sentry
for Hive and Impala, see The Sentry Service.
Installing Sentry
Use the following the instructions, depending on your operating system, to install the latest version of Sentry.
OS Command
RHEL $ sudo yum install sentry
Upgrading Sentry
Note: If you have already performed the steps to uninstall CDH 4 and all components, as described
under Upgrading from CDH 4 to CDH 5 on page 573, you can skip this step and proceed with installing
the latest version of Sentry.
OS Command
RHEL $ sudo yum remove sentry
To stop the Sentry service, identify the PID of the Sentry Service and use the kill command to end the
process:
OS Command
RHEL $ sudo yum remove sentry
Snappy Installation
Snappy is a compression/decompression library. It aims for very high speeds and reasonable compression,
rather than maximum compression or compatibility with other compression libraries. Use the following sections
to install, upgrade, and use Snappy.
• Upgrading Snappy
• Installing Snappy
• Using Snappy for MapReduce Compression
• Using Snappy for Pig Compression
• Using Snappy for Hive Compression
• Using Snappy Compression in Sqoop Imports
• Using Snappy Compression with HBase
• Apache Snappy Documentation
Upgrading Snappy
To upgrade Snappy, simply install the hadoop package if you haven't already done so. This applies whether you
are upgrading from CDH 4 or from an earlier CDH 5 release.
Note:
To see which version of Hadoop is shipping in CDH 5, check the Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Installing Snappy
Snappy is provided in the hadoop package along with the other native libraries (such as native gzip compression).
Warning:
If you install Hadoop from a tarball, Snappy may not work, because the Snappy native library may not
be compatible with the version of Linux on your system. If you want to use Snappy, install CDH 5 from
the RHEL or Debian packages.
To take advantage of Snappy compression you need to set certain configuration properties, which are explained
in the following sections.
Using Snappy for MapReduce Compression
It's very common to enable MapReduce intermediate compression, since this can make jobs run faster without
you having to make any application changes. Only the temporary intermediate files created by Hadoop for the
shuffle phase are compressed (the final output may or may not be compressed). Snappy is ideal in this case
because it compresses and decompresses very fast compared to other compression algorithms, such as Gzip.
For information about choosing a compression format, see Choosing a Data Compression Format.
To enable Snappy for MapReduce intermediate compression for the whole cluster, set the following properties
in mapred-site.xml:
• For MRv1:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
• For YARN:
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
Note:
The MRv1 property names are also supported (though deprecated) in MRv2 (YARN), so it's not
mandatory to update them in this release.
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
It is a good idea to use the --as-sequencefile option with this compression option.
• For Sqoop 2:
When you create a job (sqoop:000> create job), choose 7 (SNAPPY) as the compression format.
Spark Installation
Spark is a fast, general engine for large-scale data processing.
The following sections describe how to install and configure Spark.
• Spark Packaging on page 394
• Spark Prerequisites on page 395
• Installing and Upgrading Spark on page 395
• Configuring and Running Spark on page 396
See also the Apache Spark Documentation.
Spark Packaging
The packaging options for installing Spark are:
• RPM packages
• Debian packages
There are five Spark packages:
Spark Prerequisites
• An operating system supported by CDH 5
• Oracle JDK
• The hadoop-client package (see Installing the Latest CDH 5 Release on page 166)
Installing and Upgrading Spark
To see which version of Spark is shipping in the current release, check the CDH Version and Packaging Information.
For important information, see the CDH 5 Release Notes, in particular:
• New Features in CDH 5
• Apache Spark Incompatible Changes
• Apache Spark Known Issues
To install or upgrade the Spark packages on a RHEL-compatible system:
• To install all Spark packages:
You are now ready to configure Spark. See the next section.
Note:
If you uploaded the Spark JAR file as described under Optimizing YARN Mode on page 403, use the
same instructions to upload the new version of the file each time you upgrade to a new minor release
of CDH (for example, any CDH 5.4.x release, including 5.4.0).
Note:
As of CDH5, Cloudera recommends running Spark Applications on YARN, rather than in standalone
mode. Cloudera does not support running Spark applications on Mesos.
Before you can run Spark in standalone mode, you must do the following on every host in the cluster:
• Edit the following portion of /etc/spark/conf/spark-env.sh to point to the host where the Spark Master
runs:
###
### === IMPORTANT ===
### Change the following to specify a real cluster's Master host
###
export STANDALONE_SPARK_MASTER_HOST=`hostname`
Change 'hostname' in the last line to the actual hostname of the host where the Spark Master will run.
You can change other elements of the default configuration by modifying /etc/spark/conf/spark-env.sh.
You can change the following:
• SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
• SPARK_WORKER_CORES, to set the number of cores to use on this machine
• SPARK_WORKER_MEMORY, to set how much memory to use (for example 1000MB, 2GB)
• SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
• SPARK_WORKER_INSTANCE, to set the number of worker processes per node
• SPARK_WORKER_DIR, to set the working directory of worker processes
On Spark clients (systems from which you intend to launch Spark jobs), do the following:
1. Create /etc/spark/conf/spark-defaults.conf on the Spark client:
cp /etc/spark/conf/spark-defaults.conf.template /etc/spark/conf/spark-defaults.conf
spark.eventLog.dir=/user/spark/applicationHistory
spark.eventLog.enabled=true
This causes Spark applications running on this client to write their history to the directory that the history server
reads.
In addition, if you want the YARN ResourceManager to link directly to the Spark History Server, you can set the
spark.yarn.historyServer.address property in /etc/spark/conf/spark-defaults.conf:
spark.yarn.historyServer.address=http://HISTORY_HOST:HISTORY_PORT
For instructions for configuring the History Server to use Kerberos, see Spark Authentication.
Starting, Stopping, and Running Spark in Standalone Mode
This section provides instructions for running Spark in standalone mode; to run Spark application on YARN, see
Running Spark Applications on YARN on page 401.
Note:
As of CDH5, Cloudera recommends running Spark Applications on YARN, rather than in standalone
mode. Cloudera does not support running Spark applications on Mesos.
You can see the application by going to the Spark Master UI, by default at http://spark-master:18080, to
see the Spark Shell application, its executors and logs.
You can also use the standard SparkPi examples to test your deployment; see Running Spark Applications on
page 398.
To prevent data loss if a receiver fails, the receivers used must be able to replay data from the original data
sources if required.
• The Kafka receiver will automatically replay if the spark.streaming.receiver.writeAheadLog.enable
parameter is set to true.
• The receiver-less Direct Kafka DStream does not require the
spark.streaming.receiver.writeAheadLog.enable parameter, and can function without data loss even
without Streaming recovery.
• Both the Flume receivers that come packaged with Spark also replay the data automatically on receiver
failure.
Running Spark Applications
Spark applications are similar to MapReduce “jobs.” Each application is a self-contained computation which
runs some user-supplied code to compute a result. As with MapReduce jobs, Spark applications can make use
of the resources of multiple nodes. Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in parallel. There are currently two types
of RDDs: parallelized collections, which take an existing Scala collection and run functions on it in parallel, and
Hadoop datasets, which run functions on each record of a file in Hadoop distributed file system or any other
storage system supported by Hadoop. Both types of RDDs can be operated on through the same methods.
Each application has a driver process which coordinates its execution. This process can run in the foreground
(client mode) or in the background (cluster mode). Client mode is a little simpler, but cluster mode allows you to
easily log out after starting a Spark application without terminating the application.
Spark starts executors to perform computations. There may be many executors, distributed across the cluster,
depending on the size of the job. After loading some of the executors, Spark attempts to match tasks to executors.
CDH 5.3 introduces a performance optimization (via SPARK-1767), which causes Spark to prefer RDDs which
are already cached locally in HDFS. This is important enough that Spark will wait for the executors near these
caches to be free for a short time.
Note that Spark does not start executors on nodes with cached data, and there is no further chance to select
them during the task-matching phase. This is not a problem for most workloads, since most workloads start
executors on most or all nodes in the cluster. However, if you do have problems with the optimization, an
alternate API, the constructor DeveloperApi, is provided for writing a Spark application, which explicitly spells
out the preferred locations to start executors. See the following example, as well as
examples/src/main/scala/org/apache/spark/examples/SparkHdfsLR.scala, for a working example of
using this API.
...
val sparkConf = new SparkConf().setAppName( "SparkHdfsLR")
val inputPath = args(0)
val conf = SparkHadoopUtil.get.newConfiguration()
val sc = new SparkContext(sparkConf,
InputFormatInfo.computePreferredLocations(
Seq( new InputFormatInfo(conf, classOf[org.apache.hadoop.mapred.TextInputFormat],
inputPath))
...
/**
* :: DeveloperApi ::
* Alternative constructor for setting preferred locations where Spark will create
executors.
*
* @param preferredNodeLocationData used in YARN mode to select nodes to launch containers
on.
* Can be generated using
[[org.apache.spark.scheduler.InputFormatInfo.computePreferredLocations]]
* from a list of input files or InputFormats for the application.
*/
@DeveloperApi
def this(config: SparkConf, preferredNodeLocationData: Map[ String, Set[SplitInfo]])
= {
this(config)
this.preferredNodeLocationData = preferredNodeLocationData
}
Note:
As of CDH5, Cloudera recommends running Spark Applications on YARN, rather than in standalone
mode. Cloudera does not support running Spark applications on Mesos.
Multiple Spark applications can run at once. If you decide to run Spark on YARN, you can decide on an
application-by-application basis whether to run in YARN client mode or cluster mode. When you run Spark in
client mode, the driver process runs locally; in cluster mode, it runs remotely on an ApplicationMaster.
Note:
Some applications that have nested definitions and are run in the Spark shell may encounter a Task
not serializable exception, because of a limitation in the way Scala compiles code. Cloudera
recommends running such applications in a Spark job
The following sections use a sample application, SparkPi, which is packaged with Spark and computes the value
of Pi, to illustrate the three modes.
Configuring Spark Using the Command Line
Note:
To use Cloudera Manager to configure Spark, see Using Cloudera Manager to Configure Spark to Run
on YARN on page 402.
The easiest way to configure Spark using the command line is to use $SPARK_HOME/conf/spark-defaults.conf.
This file contains lines in the form: “key value”. You can create a comment by putting a hash mark ( # ) at the
beginning of a line.
spark.master spark://mysparkmaster.cloudera.com:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///user/spark/eventlog
# Set spark executor memory
spark.executor.memory 2g
spark.logConf true
It is a good idea to put configuration keys that you want to use for every application into spark-defaults.conf.
See this page for more information.
Note: Spark cannot handle command line options of the form --key=value; use --key value
instead. (That is, use a space instead of an equals sign.)
To run spark-submit, you need a compiled Spark application JAR. The following sections use a sample JAR,
SparkPi, which is packaged with Spark. It computes an approximation to the value of Pi.
spark-submit \
--class org.apache.spark.examples.SparkPi \
--deploy-mode client \
--master spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
spark-submit \
--class org.apache.spark.examples.SparkPi \
--deploy-mode client \
--master yarn-client \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
spark-submit \
--class org.apache.spark.examples.SparkPi \
--deploy-mode cluster \
--master yarn-cluster \
$SPARK_HOME/examples/lib/spark-examples_version.jar 10
Note:
Fat JARs must always be built against the version of Spark you intend to run (see Apache Spark
Known Issues).
Note:
As of CDH5, Cloudera recommends running Spark Applications on YARN, rather than in standalone
mode. Cloudera does not support running Spark applications on Mesos.
When Spark applications run on YARN, resource management, scheduling, and security are controlled by YARN.
You can run any given application in client mode or cluster mode. In client mode, the driver for the application
runs on the host where the job is submitted, whereas in cluster mode the driver runs on a cluster host chosen
by YARN.
For instructions for configuring the Spark History Server using the command line, see Configuring the Spark
History Server on page 396.
Using Cloudera Manager to Configure Spark to Run on YARN
To set up Spark to run on YARN on clusters managed by Cloudera Manager, use the Add Service wizard to do
the following:
1. Add the Spark service to your cluster. Make sure you choose Spark, not Spark (Standalone).
Once the service is added, Cloudera Manager will set up the required libraries and configuration on all nodes
in the cluster.
2. From the Customize Role Assignments screen of the wizard, select a host to run the Job History Server.
3. From the Customize Role Assignments screen of the wizard, add a gateway role to all hosts from which
applications will be submitted.
This ensures that these hosts have the binaries and configuration needed for submitting an application to
Spark.
4. Don't forget to click Continue (after you see the message Successfully deployed all client
configurations) and then Finish.
Submitting an Application to YARN
To submit an application to YARN, use the spark-submit script with the --master argument set to yarn-client
or yarn-cluster. Pass the FQCN of the class that contains the main method as the value of the --class
argument, and pass the JAR that contains that class after the other arguments, as shown below. Any parameters
to be passed to the application must be passed in after that.
Note:
As noted in the table above, you can specify some of the arguments as configuration parameters
instead. You can do any of the following:
• Pass them directly to the SparkConf that is used to create the SparkContext in your Spark
application; for example:
OR:
• Specify them in $SPARK_HOME/conf/spark-defaults.conf (see Configuring Spark Using the
Command Line on page 400.)
The order of precedence is:
1. Parameters passed to SparkConf
2. Arguments passed to spark-submit (or spark-shell)
3. Properties set in spark-defaults.conf
For more information, see Running Spark on YARN and Spark Configuration.
Dynamically Scaling the Number of Executors for an Application
Spark can dynamically increase and decrease the number of executors to be used for an application if the resource
requirement for the application changes over time. To enable dynamic allocation, set
spark.dynamicAllocation.enabled to true. Specify the minimum number of executors that should be
allocated to an application by means of the spark.dynamicAllocation.minExecutors configuration parameter,
and specify and the maximum number of executors by means of the spark.dynamicAllocation.minExecutors
parameter. Set the initial number of executors in the spark.dynamicAllocation.initialExecutors
configuration parameter. Do not use the --num-executors command line argument or the
spark.executor.instances parameter; they are incompatible with dynamic allocation. You can find more
information on dynamic allocation here.
Optimizing YARN Mode
Normally, Spark copies the Spark assembly JAR file to HDFS each time you run spark-submit, as you can see
in the following sample log messages:
You can avoid doing this copy each time by manually uploading the Spark assembly JAR file to your HDFS. Then
set the SPARK_JAR environment variable to this HDFS path:
Note:
In a cluster that is not managed by Cloudera Manager, you need to do this manual upload again each
time you upgrade Spark to a new minor CDH release (for example, any CDH 5.4.x release, including
5.4.0).
export
SPARK_SUBMIT_CLASSPATH=./commons-codec-1.4.jar:$SPARK_HOME/assembly/lib/*:./myapp-jar-with-dependencies.jar
Important:
The commons-codec-1.4 dependency must come before the SPARK_HOME dependencies.
4. Now you can start the pipeline using your Crunch app jar-with-dependencies file using the spark-submit
script, just as you would for a regular Spark pipeline.
Sqoop 1 Installation
Apache Sqoop 1 is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured
datastores such as relational databases. You can use Sqoop 1 to import data from external structured datastores
into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you
can use Sqoop 1 to extract data from Hadoop and export it to external structured datastores such as relational
databases and enterprise data warehouses.
Note:
To see which version of Sqoop 1 is shipping in CDH 5, check the CDH Version and Packaging Information.
For important information on new and changed components, see the CDH 5 Release Notes.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5 on page 573, you can skip Step 1 below and proceed with installing
the new CDH 5 version of Sqoop 1.
Sqoop 1 Packaging
The packaging options for installing Sqoop 1 are:
• RPM packages
• Tarball
• Debian packages
Sqoop 1 Prerequisites
• An operating system supported by CDH 5
• Oracle JDK
• Services which you wish to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for
these services when you run it, and finds services which are installed and configured. It logs warnings for
services it does not find. These warnings, shown below, are harmless.
> Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail.
> Please set $HBASE_HOME to the root of your HBase installation.
> Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.
> Please set $HCAT_HOME to the root of your HCatalog installation.
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
If you have already configured CDH on your system, there is no further configuration necessary for Sqoop 1. You
can start using Sqoop 1 by using commands such as:
$ sqoop help
$ sqoop version
$ sqoop import
Important:
Make sure you have read and understood the section on tarballs under How Packaging Affects CDH
5 Deployment on page 166 before you proceed with a tarball installation.
To install Sqoop 1 from the tarball, unpack the tarball in a convenient location. Once it is unpacked, add the bin
directory to the shell path for easy access to Sqoop 1 commands. Documentation for users and developers can
be found in the docs directory.
To install the Sqoop 1 tarball on Linux-based systems:
Run the following command:
Note:
When installing Sqoop 1 from the tarball package, you must make sure that the environment variables
JAVA_HOME and HADOOP_MAPRED_HOME are configured correctly. The variable HADOOP_MAPRED_HOME
should point to the root directory of Hadoop installation. Optionally, if you intend to use any Hive or
HBase related functionality, you must also make sure that they are installed and the variables
HIVE_HOME and HBASE_HOME are configured correctly to point to the root directory of their respective
installation.
Note:
The JDBC drivers need to be installed only on the machine where Sqoop is executed; you do not need
to install them on all nodes in your Hadoop cluster.
mkdir -p /var/lib/sqoop
chown sqoop:sqoop /var/lib/sqoop
chmod 755 /var/lib/sqoop
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar
/var/lib/sqoop/
Note:
At the time of publication, version was 5.1.31, but the version may have changed by the time you
read this.
Important:
Make sure you have at least version 5.1.31. Some systems ship with an earlier version that
may not work correctly with Sqoop.
$ curl -L
'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz'
| tar xz
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o
postgresql-9.2-1002.jdbc4.jar
$ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop/
Setting HADOOP_MAPRED_HOME
• For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or
Sqoop 1 in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly,
as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
• For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or
Sqoop 1 in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
Sqoop 2 Installation
Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can
use Sqoop 2 to import data from a relational database management system (RDBMS) such as MySQL or Oracle
into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export
it back into an RDBMS.
Sqoop 2 Packaging
There are three packaging options for installing Sqoop 2:
• Tarball (.tgz) that contains both the Sqoop 2 server and the client.
• Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
• Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
Sqoop 2 Installation
• Upgrading Sqoop 2 from CDH 4 to CDH 5 on page 409
• Upgrading Sqoop 2 from an Earlier CDH 5 Release on page 411
• Installing Sqoop 2
• Configuring Sqoop 2
• Starting, Stopping and Using the Server
• Apache Documentation
See also Feature Differences - Sqoop 1 and Sqoop 2 on page 416.
Upgrading Sqoop 2 from CDH 4 to CDH 5
To upgrade Sqoop 2 from CDH 4 to CDH 5, proceed as follows.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5 on page 573, you can skip Step 1 below and proceed with installing
the new CDH 5 version of Sqoop 2.
For more detailed instructions for upgrading Sqoop 2, see the Apache Sqoop Upgrade page.
mv /etc/defaults/sqoop2-server.rpmnew /etc/defaults/sqoop2-server
b. Update alternatives:
sqoop2-tool upgrade
mv /etc/defaults/sqoop2-server.rpmnew /etc/defaults/sqoop2-server
b. Update alternatives:
sqoop2-tool upgrade
Installing Sqoop 2
Sqoop 2 Prerequisites
• An operating system supported by CDH 5
• Oracle JDK
• Hadoop must be installed on the node which runs the Sqoop 2 server component.
• Services which you wish to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for
these services when you run it, and finds services which are installed and configured. It logs warnings for
services it does not find. These warnings, shown below, are harmless.
> Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail.
> Please set $HBASE_HOME to the root of your HBase installation.
> Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.
> Please set $HCAT_HOME to the root of your HCatalog installation.
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Installing Sqoop 2
Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package
(sqoop2-server). Install the server package on one node in the cluster; because the Sqoop 2 server acts as a
MapReduce client this node must have Hadoop installed and configured.
Install the client package on each node that will act as a client. A Sqoop 2 client will always connect to the Sqoop
2 server to perform any actions, so Hadoop does not need to be installed on the client nodes.
Depending on what you are planning to install, choose the appropriate package and install it using your preferred
package manager application.
Note: The Sqoop 2 packages can't be installed on the same machines as Sqoop1 packages. However
you can use both versions in the same Hadoop cluster by installing Sqoop1 and Sqoop 2 on different
nodes.
Note:
Installing the sqoop2-server package creates a sqoop-server service configured to start Sqoop 2
at system startup time.
You are now ready to configure Sqoop 2. See the next section.
Configuring Sqoop 2
This section explains how to configure the Sqoop 2 server.
• To use YARN:
• To use MRv1:
Important: If you are upgrading from a release earlier than CDH 5 Beta 2
In earlier releases, the mechanism for setting the MapReduce version was the CATALINA_BASEvariable
in the /etc/defaults/sqoop2-server file. This does not work as of CDH 5 Beta 2, and in fact could
cause problems. Check your /etc/defaults/sqoop2-server file and make sure CATALINA_BASE
is not set.
Note:
There is currently no recommended way to migrate data from an existing Derby database into the
new PostgreSQL database.
Use the procedure that follows to configure Sqoop 2 to use PostgreSQL instead of Apache Derby.
$ psql -U postgres
Password for user postgres: *****
TABLESPACE = pg_default
LC_COLLATE = 'en_US.UTF8'
LC_CTYPE = 'en_US.UTF8'
CONNECTION LIMIT = -1;
CREATE DATABASE
postgres=# \q
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=isolation level
org.apache.sqoop.repository.jdbc.maximum.connections=max connections
org.apache.sqoop.repository.jdbc.url=jdbc URL
org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver
org.apache.sqoop.repository.jdbc.user=username
org.apache.sqoop.repository.jdbc.password=password
org.apache.sqoop.repository.jdbc.properties.property=value
Note:
• Replace isolation level with a value such as READ_COMMITTED.
• Replace max connections with a value such as 10.
• Replace jdbc URL with the hostname on which you installed PostgreSQL.
• Replace username with (in this example) sqoop
• Replace password with (in this example) sqoop
• Use org.apache.sqoop.repository.jdbc.properties.property to set each additional
property you want to configure; see https://jdbc.postgresql.org/documentation/head/connect.html
for details. For example, replace property with loglevel and value with 3
Note:
The JDBC drivers need to be installed only on the machine where Sqoop is executed; you do not need
to install them on all nodes in your Hadoop cluster.
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar
/var/lib/sqoop2/
At the time of publication, version was 5.1.31, but the version may have changed by the time you read this.
Important:
Make sure you have at least version 5.1.31. Some systems ship with an earlier version that may not
work correctly with Sqoop.
$ curl -L
'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz'
| tar xz
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o
postgresql-9.2-1002.jdbc4.jar
$ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/
You should get a text fragment in JSON format similar to the following:
{"version":"1.99.2-cdh5.0.0",...}
sqoop2
Identify the host where your server is running (we will use localhost in this example):
Test the connection by running the command show version --all to obtain the version number from server.
You should see output similar to the following:
Note:
Moving from Sqoop 1 to Sqoop 2: Sqoop 2 is essentially the future of the Apache Sqoop project.
However, since Sqoop 2 currently lacks some of the features of Sqoop 1, Cloudera recommends you
use Sqoop 2 only if it contains all the features required for your use case, otherwise, continue to use
Sqoop 1.
Whirr Installation
Apache Whirr is a set of libraries for running cloud services. You can use Whirr to run CDH 5 clusters on cloud
providers' clusters, such as Amazon Elastic Compute Cloud (Amazon EC2). There's no need to install the RPMs
for CDH 5 or do any configuration; a working cluster will start immediately with one command. It's ideal for
running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs. When you are
finished, you can destroy the cluster and all of its data with one command.
Use the following sections to install, upgrade, and deploy Whirr:
• Upgrading Whirr
• Installing Whirr
• Generating an SSH Key Pair
• Defining a Cluster
• Launching a Cluster
• Apache Whirr Documentation
Upgrading Whirr
Note:
To see which version of Whirr is shipping in CDH 5, check the Version and Packaging Information. For
important information on new and changed components, see the CDH 5 Release Notes.
Note:
If you have already performed the steps to uninstall CDH 4 and all components, as described under
Upgrading from CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH
5 version of Whirr.
On SLES systems:
whirr.env.repo=cdh5
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
whirr.env.repo=cdh5
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
whirr.hbase.install-function=install_cdh_hbase
whirr.hbase.configure-function=configure_cdh_hbase
whirr.zookeeper.install-function=install_cdh_zookeeper
whirr.zookeeper.configure-function=configure_cdh_zookeeper
whirr.env.repo=cdh5
whirr.zookeeper.install-function=install_cdh_zookeeper
whirr.zookeeper.configure-function=configure_cdh_zookeeper
Important:
If you are upgrading from Whirr version 0.3.0, and are using an explicit image (AMI), make sure it
comes from one of the supplied Whirr recipe files.
$ whirr version
Note:
If you specify a non-standard location for the key files in the ssh-keygen command (that is, not
~/.ssh/id_rsa), then you must specify the location of the private key file in the
whirr.private-key-file property and the public key file in the whirr.public-key-file property.
For more information, see the next section.
Note:
For information on finding your cloud credentials, see the Whirr FAQ.
After generating an SSH key pair, the only task left to do before using Whirr is to define a cluster by creating a
properties file. You can name the properties file whatever you like. The example properties file used in these
instructions is named hadoop.properties. Save the properties file in your home directory. After defining a
cluster in the properties file, you will be ready to launch a cluster and run MapReduce jobs.
Important:
The properties shown below are sufficient to get a bare-bones cluster up and running, but you will
probably need to do more configuration to do real-life tasks, especially if you are using HBase and
ZooKeeper. You can find more comprehensive template files in the recipes directory, for example
recipes/hbase-cdh.properties.
MRv1 Cluster
The following file defines a cluster with a single machine for the NameNode and JobTracker, and another machine
for a DataNode and TaskTracker.
whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1
hadoop-datanode+hadoop-tasktracker
whirr.provider=aws-ec2
whirr.identity=<cloud-provider-identity>
whirr.credential=<cloud-provider-credential>
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.env.repo=cdh5
whirr.hadoop-install-function=install_cdh_hadoop
whirr.hadoop-configure-function=configure_cdh_hadoop
whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-ccb35ea5
whirr.location-id=us-east-1
YARN Cluster
The following configuration provides the essentials for a YARN cluster. Change the number of instances for
hadoop-datanode+yarn-nodemanager from 2 to a larger number if you need to.
whirr.cluster-name=myhadoopcluster
whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2
hadoop-datanode+yarn-nodemanager
whirr.provider=aws-ec2
whirr.identity=<cloud-provider-identity>
whirr.credential=<cloud-provider-credential>
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.env.mapreduce_version=2
whirr.env.repo=cdh5
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory
whirr.yarn.configure-function=configure_cdh_yarn
whirr.yarn.start-function=start_cdh_yarn
whirr.hardware-id=m1.large
whirr.image-id=us-east-1/ami-ccb35ea5
whirr.location-id=us-east-1
Launching a Cluster
To launch a cluster:
As the cluster starts up, messages are displayed in the console. You can see debug-level log messages in a file
named whirr.log in the directory where you ran the whirr command. After the cluster has started, a message
appears in the console showing the URL you can use to access the web UI for Whirr.
$ . ~/.whirr/myhadoopcluster/hadoop-proxy.sh
$ cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.whirr
$ rm -f /etc/hadoop/conf.whirr/*-site.xml
$ cp ~/.whirr/myhadoopcluster/hadoop-site.xml /etc/hadoop/conf.whirr
2. If you are using an Ubuntu, Debian, or SLES system, type these commands:
$ hadoop fs -ls /
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_MAPRED_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-examples.jar wordcount input output
$ hadoop fs -cat output/part-* | head
• For YARN:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
$ hadoop fs -mkdir input
$ hadoop fs -put $HADOOP_MAPRED_HOME/CHANGES.txt input
$ hadoop jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples.jar wordcount input output
$ hadoop fs -cat output/part-* | head
Destroying a cluster
When you are finished using a cluster, you can terminate the instances and clean up the resources using the
commands shown in this section.
WARNING
All data will be deleted when you destroy the cluster.
To destroy a cluster:
1. Run the following command to destroy a cluster:
2. Shut down the SSH proxy to the cluster if you started one earlier.
Viewing the Whirr Documentation
For additional documentation see the Whirr Documentation.
ZooKeeper Installation
Apache ZooKeeper is a highly reliable and available service that provides coordination between distributed
processes.
Note:
To see which version of ZooKeeper is shipping in CDH 5, check the CDH Version and Packaging
Information. For important information on new and changed components, see the Cloudera Release
Guide.
Note: If you have already performed the steps to uninstall CDH 4 described under Upgrading from
CDH 4 to CDH 5, you can skip Step 1 below and proceed with Step 2.
or
• The zookeeper-server package contains the init.d scripts necessary to run ZooKeeper as a daemon
process. Because zookeeper-server depends on zookeeper, installing the server package automatically
installs the base package.
Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server
The instructions provided here deploy a single ZooKeeper server in "standalone" mode. This is appropriate for
evaluation, testing and development purposes, but may not provide sufficient reliability for a production
application. See Installing ZooKeeper in a Production Environment on page 426 for more information.
To install the ZooKeeper Server On Red Hat-compatible systems:
mkdir -p /var/lib/zookeeper
chown -R zookeeper /var/lib/zookeeper/
To start ZooKeeper
Note:
ZooKeeper may start automatically on installation on Ubuntu and other Debian systems. This automatic
start will happen only if the data directory exists; otherwise you will be prompted to initialize as shown
below.
Note:
If you are deploying multiple ZooKeeper servers after a fresh install, you need to create a myid file in
the data directory. You can do this by means of an init command option: $ sudo service
zookeeper-server init --myid=1
tickTime=2000
dataDir=/var/lib/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
In this example, the final three lines are in the form server.id=hostname:port:port. The first port is for
a follower in the ensemble to listen on for the leader; the second is for leader election. You set id for each
server in the next step.
4. Create a file named myid in the server's DataDir; in this example, /var/lib/zookeeper/myid . The file
must contain only a single line, and that line must consist of a single unique number between 1 and 255;
this is the id component mentioned in the previous step. In this example, the server whose hostname is
zoo1 must have a myid file that contains only 1.
5. Start each server as described in the previous section.
6. Test the deployment by running a ZooKeeper client:
For example:
For more information on configuring a multi-server deployment, see Clustered (Multi-Server) Setup in the
ZooKeeper Administrator's Guide.
Setting up Supervisory Process for the ZooKeeper Server
The ZooKeeper server is designed to be both highly reliable and highly available. This means that:
• If a ZooKeeper server encounters an error it cannot recover from, it will "fail fast" (the process will exit
immediately)
• When the server shuts down, the ensemble remains active, and continues serving requests
• Once restarted, the server rejoins the ensemble without any further manual intervention.
Cloudera recommends that you fully automate this process by configuring a supervisory service to manage each
server, and restart the ZooKeeper server process automatically if it fails. See the ZooKeeper Administrator's
Guide for more information.
Maintaining a ZooKeeper Server
The ZooKeeper server continually saves znode snapshot files and, optionally, transactional logs in a Data Directory
to enable you to recover data. It's a good idea to back up the ZooKeeper Data Directory periodically. Although
ZooKeeper is highly reliable because a persistent copy is replicated on each server, recovering from backups
may be necessary if a catastrophic failure or user error occurs.
When you use the default configuration, the ZooKeeper server does not remove the snapshots and log files, so
they will accumulate over time. You will need to clean up this directory occasionally, taking into account on your
backup schedules and processes. To automate the cleanup, a zkCleanup.sh script is provided in the bin directory
of the zookeeper base package. Modify this script as necessary for your situation. In general, you want to run
this as a cron task based on your backup schedule.
The data directory is specified by the dataDir parameter in the ZooKeeper configuration file, and the data log
directory is specified by the dataLogDir parameter.
For more information, see Ongoing Data Directory Cleanup.
Viewing the ZooKeeper Documentation
For additional ZooKeeper documentation, see http://archive.cloudera.com/cdh5/cdh/5/zookeeper/.
Avro Usage
Apache Avro is a serialization system. Avro supports rich data structures, a compact binary encoding, and a
container file for sequences of Avro data (often referred to as "Avro data files"). Avro is designed to be
language-independent and there are several language bindings for it, including Java, C, C++, Python, and Ruby.
Avro does not rely on generated code, which means that processing data imported from Flume or Sqoop 1 is
simpler than using Hadoop Writables in Sequence Files, where you have to take care that the generated classes
are on the processing job's classpath. Furthermore, Pig and Hive cannot easily process Sequence Files with
custom Writables, so users often revert to using text, which has disadvantages from a compactness and
compressibility point of view (compressed text is not generally splittable, making it difficult to process efficiently
using MapReduce).
All components in CDH 5 that produce or consume files support Avro data files as a file format. But bear in mind
that because uniform Avro support is new, there may be some rough edges or missing features.
The following sections contain brief notes on how to get started using Avro in the various CDH 5 components:
• Avro Data Files
• Compression
• Flume
• Sqoop
• MapReduce
• Streaming
• Pig
• Hive
Avro Data Files
Avro data files have the .avro extension. Make sure the files you create have this extension, since some tools
look for it to determine which files to process as Avro (e.g. AvroInputFormat and AvroAsTextInputFormat
for MapReduce and Streaming).
Compression
By default Avro data files are not compressed, but it is generally advisable to enable compression to reduce disk
usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy
is faster, while Deflate is slightly more compact.
You do not need to do any additional configuration to read a compressed Avro data file rather than an
uncompressed one. However, to write an Avro data file you need to specify the type of compression to use. How
you specify compression depends on the component being used, as explained in the sections below.
Flume
The HDFSEventSink that is used to serialize event data onto HDFS supports plugin implementations of
EventSerializer interface. Implementations of this interface have full control over the serialization format and
can be used in cases where the default serialization format provided by the Sink does not suffice.
An abstract implementation of the EventSerializer interface is provided along with Flume, called the
AbstractAvroEventSerializer. This class can be extended to support custom schema for Avro serialization over
HDFS. A simple implementation that maps the events to a representation of String header map and byte payload
in Avro is provided by the class FlumeEventAvroEventSerializer which can be used by setting the serializer
property of the Sink as follows:
<agent-name>.sinks.<sink-name>.serializer = AVRO_EVENT
Sqoop 1
On the command line, use the following option to import to Avro data files:
--as-avrodatafile
Sqoop 1 will automatically generate an Avro schema that corresponds to the database table being exported
from.
To enable Snappy compression, add the following option:
--compression-codec snappy
Note:
Sqoop 2 does not currently support Avro.
MapReduce
The Avro MapReduce API is an Avro module for running MapReduce programs which produce or consume Avro
data files.
If you are using Maven, simply add the following dependency to your POM:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-mapred</artifactId>
<version>1.7.3</version>
<classifier>hadoop2</classifier>
</dependency>
Then write your program using the Avro MapReduce javadoc for guidance.
At runtime, include the avro and avro-mapred JARs in the HADOOP_CLASSPATH; and the avro, avro-mapred
and paranamer JARs in -libjars.
To enable Snappy compression on output files call AvroJob.setOutputCodec(job, "snappy") when configuring
the job. You will also need to include the snappy-java JAR in -libjars.
Streaming
To read from Avro data files from a streaming program, specify
org.apache.avro.mapred.AvroAsTextInputFormat as the input format. This input format will convert each
datum in the Avro data file to a string. For a "bytes" schema, this will be the raw bytes, while in the general
case it will be a single-line JSON representation of the datum.
To write to Avro data files from a streaming program, specify org.apache.avro.mapred.AvroTextOutputFormat
as the output format. This output format will create Avro data files with a "bytes" schema, where each datum
is a tab-delimited key-value pair.
At runtime specify the avro, avro-mapred and paranamer JARs in -libjars in the streaming command.
To enable Snappy compression on output files, set the property avro.output.codec to snappy. You will also
need to include the snappy-java JAR in -libjars.
Pig
CDH provides AvroStorage for Avro integration in Pig.
To use it, first register the piggybank JAR file and supporting libraries:
REGISTER piggybank.jar
REGISTER lib/avro-1.7.3.jar
REGISTER lib/json-simple-1.1.jar
REGISTER lib/snappy-java-1.0.4.1.jar
In the case of store, Pig generates an Avro schema from the Pig schema. It is possible to override the Avro
schema, either by specifying it literally as a parameter to AvroStorage, or by using the same schema as an
existing Avro data file. See the Pig wiki for details.
To store two relations in one script, specify an index to each store function. Here is an example:
For more information, see the AvroStorage wiki; look for "index".
To enable Snappy compression on output files do the following before issuing the STORE statement:
There is some additional documentation on the Pig wiki. Note, however, that the version numbers of the JAR
files to register are different on that page, so you should adjust them as shown above.
Hive
The following example demonstrates how to create a Hive table that is backed by Avro data files:
You could also create a Avro backed Hive table by using an Avro schema file:
The avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file that is used for reading
and writing, it could also be an hdfs URL, eg. hdfs://hadoop-namenode-uri/examplefile
To enable Snappy compression on output files, run the following before writing to the table:
SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;
You will also need to include the snappy-java JAR in --auxpath. The snappy-java JAR is located at:
/usr/lib/hive/lib/snappy-java-1.0.4.1.jar
Haivvreo SerDe has been merged into Hive as AvroSerDe, and it is no longer supported in its original form.
schema.url and schema.literal have been changed to avro.schema.url and avro.schema.literal as a
result of the merge. If you were you using Haivvreo SerDe, you can use the new Hive AvroSerDe with tables
created with the Haivvreo SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe,
you can do the following to make the table use the new AvroSerDe:
Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce
Parquet is automatically installed when you install any of the above components, and the necessary libraries
are automatically placed in the classpath for all of them. Copies of the libraries are in /usr/lib/parquet or
inside the parcels in /lib/parquet.
The Parquet file format incorporates several features that make it highly suited to data warehouse-style
operations:
• Columnar storage layout. A query can examine and perform calculations on all values for a column while
reading only a small fraction of the data from a data file or table.
• Flexible compression options. The data can be compressed with any of several codecs. Different data files
can be compressed differently. The compression is transparent to applications that read the data files.
• Innovative encoding schemes. Sequences of identical, similar, or related data values can be represented in
ways that save disk space and memory. The encoding schemes provide an extra level of space savings beyond
the overall compression for each data file.
• Large file size. The layout of Parquet data files is optimized for queries that process large volumes of data,
with individual files in the multi-megabyte or even gigabyte range.
Among components of the CDH distribution, Parquet support originated in Cloudera Impala. Impala can create
Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL
queries on the resulting data files. Parquet tables created by Impala can be accessed by Hive, and vice versa.
The CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data
processing. For example, you can read and write Parquet files using Pig and MapReduce jobs. You can convert,
transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all
of those components.
Using Parquet Tables with Impala
The Cloudera Impala component can create tables that use Parquet data files; insert data into those tables,
converting the data into Parquet format; and query Parquet data files produced by Impala or by other components.
The only syntax required is the STORED AS PARQUET clause on the CREATE TABLE statement. After that, all
SELECT, INSERT, and other statements recognize the Parquet format automatically. For example, a session in
the impala-shell interpreter might look as follows:
Once you create a Parquet table this way in Impala, you can query it or insert into it through either Impala or
Hive.
Remember that Parquet format is optimized for working with large data files. In Impala 2.0 and later, the default
size of Parquet files written by Impala is 256 MB; in earlier releases, 1 GB. Avoid using the INSERT ... VALUES
syntax, or partitioning the table at too granular a level, if that would produce a large number of small files that
cannot take advantage of the Parquet optimizations for large data chunks.
Inserting data into a partitioned Impala table can be a memory-intensive operation, because each data file
requires a memory buffer to hold the data before being written. Such inserts can also exceed HDFS limits on
simultaneous open files, because each node could potentially write to a separate data file for each partition, all
at the same time. Always make sure table and column statistics are in place for any table used as the source
for an INSERT ... SELECT operation into a Parquet table. If capacity problems still occur, consider splitting up
such insert operations into one INSERT statement per partition.
For complete instructions and examples, see Using the Parquet File Format with Impala Tables.
Using Parquet Tables in Hive
To create a table named PARQUET_TABLE that uses the Parquet format, you would use a command like the
following, substituting your own table name, column names, and data types:
Note:
• Once you create a Parquet table this way in Hive, you can query it or insert into it through either
Impala or Hive. Before the first time you access a newly created Hive table through Impala, issue
a one-time INVALIDATE METADATA statement in the impala-shell interpreter to make Impala
aware of the new table.
• dfs.block.size should be set to 1GB in hdfs-site.xml.
If the table will be populated with data files generated outside of Impala and Hive, it is often useful to create
the table as an external table pointing to the location where the files will be created:
To populate the table with an INSERT statement, and to read the table with a SELECT statement, see Using the
Parquet File Format with Impala Tables.
Select the compression to use when writing data with the parquet.compression property, for example:
set parquet.compression=GZIP;
INSERT OVERWRITE TABLE tinytable SELECT * FROM texttable;
There are three compression options: uncompressed, snappy, and gzip. The default is snappy. You can specify
one of them once before the first store instruction in a Pig script:
if [ -e /opt/cloudera/parcels/CDH ] ; then
CDH_BASE=/opt/cloudera/parcels/CDH
else
CDH_BASE=/usr
fi
THRIFTJAR=`ls -l $CDH_BASE/lib/hive/lib/libthrift*jar | awk '{print $9}' | head -1`
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$THRIFTJAR
export LIBJARS=`echo "$CLASSPATH" | awk 'BEGIN { RS = ":" } { print }' | grep
parquet-format | tail -1`
export LIBJARS=$LIBJARS,$THRIFTJAR
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import parquet.Log;
import parquet.example.data.Group;
import parquet.hadoop.example.ExampleInputFormat;
Log.getLog(TestReadParquet.class);
/*
* Read a Parquet record
*/
public static class MyMap extends
Mapper<LongWritable, Group, NullWritable, Text> {
@Override
public void map(LongWritable key, Group value, Context context) throws IOException,
InterruptedException {
NullWritable outKey = NullWritable.get();
String outputRecord = "";
// Get the schema and field values of the record
String inputRecord = value.toString();
// Process the value, create an output record
// ...
context.write(outKey, new Text(outputRecord));
}
}
job.setJarByClass(getClass());
job.setJobName(getClass().getName());
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(MyMap.class);
job.setNumReduceTasks(0);
job.setInputFormatClass(ExampleInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
...
import parquet.Log;
import parquet.example.data.Group;
import parquet.hadoop.example.GroupWriteSupport;
import parquet.hadoop.example.ExampleInputFormat;
import parquet.hadoop.example.ExampleOutputFormat;
import parquet.hadoop.metadata.CompressionCodecName;
import parquet.hadoop.ParquetFileReader;
import parquet.hadoop.metadata.ParquetMetadata;
import parquet.schema.MessageType;
import parquet.schema.MessageTypeParser;
import parquet.schema.Type;
...
public int run(String[] args) throws Exception {
...
job.submit();
or it can be extracted from the input file(s) if they are in Parquet format:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.RemoteIterator;
...
RemoteIterator<LocatedFileStatus> it = FileSystem.get(getConf()).listFiles(new
Path(inputFile), true);
while(it.hasNext()) {
FileStatus fs = it.next();
if(fs.isFile()) {
parquetFilePath = fs.getPath();
break;
}
}
if(parquetFilePath == null) {
LOG.error("No file found for " + inputFile);
return 1;
}
ParquetMetadata readFooter =
ParquetFileReader.readFooter(getConf(), parquetFilePath);
MessageType schema =
readFooter.getFileMetaData().getSchema();
GroupWriteSupport.setSchema(schema, getConf());
job.submit();
Records can then be written in the mapper by composing a Group as value using the Example classes and no
key:
ExampleOutputFormat.setCompression(job, codec);
• Prerequisites
• Setting up an Environment for Building RPMs
• Building an RPM
Prerequisites
• Oracle Java Development Kit (JDK) version 6.
• Apache Ant version 1.7 or later.
• Apache Maven 3.0 or later.
• The following environment variables must be set: JAVA_HOME, JAVA5_HOME, FORREST_HOME, and ANT_HOME.
• Your PATH must include the JAVA_HOME, ANT_HOME, FORREST_HOME and maven bin directories.
• If you are using Red Hat or CentOS systems, the rpmdevtools package is required for the rpmdev-setuptree
command used below.
SLES systems
Users of these systems can run the following command to set up their environment:
$ mkdir -p ~/rpmbuild/{BUILD,RPMS,S{OURCE,PEC,RPM}S}
$ echo "%_topdir $HOME/rpmbuild"> ~/.rpmmacros
Building an RPM
Download SRPMs from archive.cloudera.com. The source RPMs for CDH 5 reside at
http://archive.cloudera.com/cdh5/redhat/5/x86_64/cdh/5/SRPMS/,
http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/SRPMS/ or
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/SRPMS/. Run the following commands as a
non-root user, substituting the particular SRPM that you intend to build:
$ export SRPM=hadoop-0.20-0.20.2+320-1.src.rpm
$ rpmbuild --nodeps --rebuild $SRPM # Builds the native RPMs
$ rpmbuild --nodeps --rebuild --target noarch $SRPM # Builds the java RPMs
Apache License
All software developed by Cloudera for CDH is released with an Apache 2.0 license. Please let us know if you
find any file that doesn't explicitly state the Apache license at the top and we'll immediately fix it.
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/
Copyright 2010-2013 Cloudera
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under the License.
Third-Party Licenses
For a list of third-party licenses associated with CDH, see
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Licenses/Third-Party-Licenses/Third-Party-Licenses.html.
Debian and Ubuntu apt-get remove or apt-get apt-get can be run with the remove
purge option to remove only the installed
packages or with the purge option
to remove packages and
configuration
Warning:
For this reason, you should apt-get remove only with great care, and after making sure you have
backed up all your configuration data.
The apt-get remove commands to uninstall the Hadoop components from a Debian or Ubuntu system are:
Additional clean-up
The uninstall commands may not remove all traces of Hadoop from your system. The apt-get purge commands
available for Debian and Ubuntu systems delete more files than the commands that use the remove option but
are still not comprehensive. If you want to remove all vestiges of Hadoop from your system, look for the following
and remove them manually:
• log files
• modified system configuration files
• Hadoop configuration files in directories under /etc such as hadoop, hbase, hue, hive, oozie, sqoop,
zookeeper, and zookeeper.dist
• user/group identifiers
• Oozie and Hue databases
• Documentation packages
Upgrade
This section describes how to upgrade Cloudera Manager and Cloudera Navigator. In addition, this section
describes how to upgrade CDH. Depending on whether your CDH deployment is managed by Cloudera Manager
or unmanaged, you can upgrade CDH and managed services using Cloudera Manager or upgrade an unmanaged
CDH 5 deployment using the command line.
Note: When an upgraded Cloudera Manager adds support for a new feature (for example, Sqoop 2,
WebHCat, and so on), it does not install the software on which the new feature depends. If you install
CDH and managed services from packages, you must add the packages to your managed hosts before
adding a service or role that supports the new feature.
Understanding Upgrades
The process for upgrading Cloudera Manager varies depending on the starting point. Categories of tasks to be
completed include the following:
• Install databases required for the release. In Cloudera Manager 5, the Host Monitor and Service Monitor roles
use an internal database that provides greater capacity and flexibility. You do not need to configure an
external database for these roles. If you are upgrading from Cloudera Manager 4, this transition is handled
automatically. If you are upgrading a Free Edition installation and you are running a MapReduce service, you
are asked to configure an additional database for the Activity Monitor that is part of Cloudera Express.
• Upgrade the Cloudera Manager Server.
• Upgrade the Cloudera Manager Agent. You can use an upgrade wizard that is invoked when you connect to
the Admin Console or manually install the Cloudera Manager Agent packages.
– Cloudera Manager 5 continues to support a CDH 4 cluster with an existing high availability deployment
using NFS shared edits directories. However, if you disable high availability in Cloudera Manager 5, you
can re-enable high availability only by using Quorum-based Storage. CDH 5 does not support enabling
NFS shared edits directories with high availability.
Upgrading CDH
Cloudera Manager 5 can manage both CDH 4 and CDH 5, so upgrading existing CDH 4 installations is not required.
However, to benefit from the most current CDH features, you must upgrade CDH. For more information on
upgrading CDH, see Upgrading CDH and Managed Services Using Cloudera Manager on page 479.
Note:
Cloudera Manager 4.5 added support for Hive, which includes the Hive Metastore Server role type.
This role manages the metastore process when Hive is configured with a remote metastore.
When upgrading from Cloudera Manager versions before 4.5, Cloudera Manager automatically creates
new Hive services to capture the previous implicit Hive dependency from Hue and Impala. Your
previous services continue to function without impact. If Hue was using a Hive metastore backed by
a Derby database, the newly created Hive Metastore Server also uses Derby. Because Derby does not
allow concurrent connections, Hue continues to work, but the new Hive Metastore Server does not
run. The failure is harmless (because nothing uses this new Hive Metastore Server at this point) and
intentional, to preserve the set of cluster functionality as it was before upgrade. Cloudera discourages
the use of a Derby-backed Hive metastore due to its limitations and recommends switching to a
different supported database.
After you have completed these steps, the upgrade processes automatically complete any additional updates
to database schema and service data stored. You do not need to complete any data migration.
Backing up Databases
Before beginning the upgrade process, shut down the services that are using databases. This includes the
Cloudera Manager Management Service roles, the Hive Metastore Server, and Cloudera Navigator, if it is in use.
Cloudera strongly recommends that you then back up all databases, however backing up the Activity Monitor
database is optional. For information on backing up databases see Backing Up Databases on page 65.
4. Review the contents of the exported database for non-standard characters. If you find unexpected characters,
modify these so the database backup file contains the expected data.
5. Import the database backup to the newly created database.
From the maximum number of connections, you can determine the number of anticipated sessions using the
following formula:
For example, if a host has two databases, you anticipate 250 maximum connections. If you anticipate a maximum
of 250 connections, plan for 280 sessions.
Once you know the number of sessions, you can determine the number of anticipated transactions using the
following formula:
Continuing with the previous example, if you anticipate 280 sessions, you can plan for 308 transactions.
Work with your Oracle database administrator to apply these derived values to your system.
Using the sample values above, Oracle attributes would be set as follows:
Next Steps
After you have completed any required database preparatory tasks, continue to Upgrading Cloudera Manager
4 to Cloudera Manager 5 on page 457 or Upgrading Cloudera Manager 5 to the Latest Cloudera Manager on page
445.
Required Role:
This process applies to upgrading all versions of Cloudera Manager 5.
In most cases it is possible to complete the following upgrade without shutting down most CDH services, although
you may need to stop some dependent services. CDH daemons can continue running, unaffected, while Cloudera
Manager is upgraded. The upgrade process does not affect your CDH installation. After upgrading Cloudera
Manager you may also want to upgrade CDH 4 clusters to CDH 5.
Upgrading Cloudera Manager 5 to the latest version of Cloudera Manager involves the following steps.
Review Warnings
Warning:
• Cloudera Management Service SSL configuration
If you have enabled TLS security for the Cloudera Manager Admin Console, as of Cloudera Manager
5.1, Cloudera Management Service roles try to communicate with Cloudera Manager using TLS,
and fail to start until SSL properties have been configured.
• Navigator
If you have enabled auditing with Cloudera Navigator, during the upgrade to Cloudera Manager 5,
auditing is suspended and is only restarted when you restart the roles of audited services.
• JDK upgrade
If you upgrade the JDK during the installation of the Cloudera Manager Agent, you must restart
all services. Additionally, if you have enabled SSL, you must reinstall CA certificates to your
truststores. See Creating Truststores.
Condition Procedure
Running a version of Cloudera Manager that has the Stop the Cloudera Management Service.
Cloudera Management Service
Running the embedded PostgreSQL database Stop the Hive service and all services such as Impala
and Hue that use the Hive metastore.
Running Cloudera Navigator Stop any of the following roles whose service's Queue
Policy configuration
(navigator.batch.queue_policy) is set to
SHUTDOWN:
• HDFS - NameNode
• HBase - Master and RegionServers
• Hive - HiveServer2
• Hue - Beeswax Server
Stopping these roles renders any service depending on
these roles unavailable. For the HDFS - NameNode
case this implies most of the services in the cluster will
be unavailable until the upgrade is finished.
3. If you are using the embedded PostgreSQL database for Cloudera Manager, stop the database:
Important: If you are not running the embedded database service and you attempt to stop it, you
receive a message indicating that the service cannot be found. If instead you get a message that
the shutdown failed, the embedded database is still running, probably because services are
connected to the Hive metastore. If the database shutdown fails due to connected services, issue
the following command:
4. If the Cloudera Manager host is also running the Cloudera Manager Agent, stop the Cloudera Manager Agent:
(Optional) Upgrade the JDK on Cloudera Manager Server and Agent Hosts
If you are manually upgrading the Cloudera Manager Agent software in Upgrade and Start Cloudera Manager
Agents (Packages) on page 467 or Install Cloudera Manager Server and Agent Software (Tarballs) on page 463,
and you are upgrading to CDH 5, install the Oracle JDK on the Agent hosts as described in Java Development Kit
Installation on page 41.
If you are not running Cloudera Manager Server on the same host as a Cloudera Manager Agent, and you want
all hosts to run the same JDK version, optionally install the Oracle JDK on that host.
[cloudera-manager]
# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera Manager
baseurl=http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5/
gpgkey = http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/RPM-GPG-KEY-cloudera
gpgcheck = 1
For Ubuntu or Debian systems, navigate to the appropriate release directory, for example,
http://archive.cloudera.com/cm4/debian/wheezy/amd64/cm. The repo file, in this case,
cloudera.list, is similar to the following:
b. Replace the repo file in the configuration location for the package management software for your system.
Note:
• yum clean all cleans yum cache directories, ensuring that you
download and install the latest versions of the packages.
• If your system is not up to date, any underlying system components
must be upgraded before yum update can succeed. yum indicates
which components must be upgraded.
Ubuntu or Debian The following commands clean cached repository information and update Cloudera
Manager components:
$ sudo apt-get clean
$ sudo apt-get update
$ sudo apt-get dist-upgrade
$ sudo apt-get install cloudera-manager-server
cloudera-manager-agent cloudera-manager-daemons
During this process, you may be prompted about your configuration file version:
Configuration file `/etc/cloudera-scm-agent/config.ini'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
You will receive a similar prompt for /etc/cloudera-scm-server/db.properties.
Answer N to both prompts.
You should now have the following packages, corresponding to the version of Cloudera Manager you installed,
on the host that will be the Cloudera Manager Server host.
OS Packages
RPM-based distributions $ rpm -qa 'cloudera-manager-*'
cloudera-manager-agent-5.4.2-0.cm5.p0.932.el6.x86_64
cloudera-manager-server-5.4.2-0.cm5.p0.932.el6.x86_64
cloudera-manager-daemons-5.4.2-0.cm5.p0.932.el6.x86_64
You may also see an entry for the cloudera-manager-server-db-2 if you are using the embedded database,
and additional packages for plug-ins, depending on what was previously installed on the server host. If the
cloudera-manager-server-db-2 package is installed, and you do not plan to use the embedded database,
you can remove this package.
Install Cloudera Manager Server and Agent Software (Tarballs)
Tarballs contain both the Cloudera Manager Server and Cloudera Manager Agent in a single file. Download
tarballs from the locations listed in Cloudera Manager Version and Download Information. Copy the tarballs and
unpack them on all hosts on which you intend to install Cloudera Manager Server and Cloudera Manager Agents,
in a directory of your choosing. If necessary, create a new directory to accommodate the files you extract from
the tarball. For instance, if /opt/cloudera-manager does not exist, create it using a command similar to:
When you have a directory to which to extract the contents of the tarball, extract the contents. For example, to
copy a tar file to your home directory and extract the contents of all tar files to the /opt/ directory, use a
command similar to the following:
The files are extracted to a subdirectory named according to the Cloudera Manager version being extracted. For
example, files could extract to /opt/cloudera-manager/cm-5.0/. This full path is needed later and is referred
to as tarball_root directory.
Property Description
server_host Name of the host where Cloudera Manager Server is running.
server_port Port on the host where Cloudera Manager Server is running.
• By default, a tarball installation has a var subdirectory where state is stored. In a non-tarball installation,
state is stored in /var. Cloudera recommends that you reconfigure the tarball installation to use an external
directory as the /var equivalent (/var or any other directory outside the tarball) so that when you upgrade
Cloudera Manager, the new tarball installation can access this state. Configure the installation to use an
external directory for storing state by editing tarball_root/etc/default/cloudera-scm-agent and setting
the CMF_VAR variable to the location of the /var equivalent. If you do not reuse the state directory between
different tarball installations, duplicate Cloudera Manager Agent entries can occur in the Cloudera Manager
database.
Starting cloudera-scm-server: [ OK ]
• As another user. If you run as another user, ensure the user you created for Cloudera Manager owns the
location to which you extracted the tarball including the newly created database files. If you followed the
earlier examples and created the directory /opt/cloudera-manager and the user cloudera-scm, you could
use the following command to change ownership of the directory:
Once you have established ownership of directory locations, you can start Cloudera Manager Server using
the user account you chose. For example, you might run the Cloudera Manager Server as cloudera-service.
In this case, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user. Then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-server:
USER=cloudera-service
GROUP=cloudera-service
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ chkconfig cloudera-scm-server on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ update-rc.d cloudera-scm-server defaults
2. On the Cloudera Manager Server host, open the /etc/init.d/cloudera-scm-server file and change
the value of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on page
612.
Important: All hosts in the cluster must have access to the Internet if you plan to use
archive.cloudera.com as the source for installation files. If you do not have Internet access, create
a custom repository.
• If local laws permit you to deploy unlimited strength encryption, and you are running a secure
cluster, check the Install Java Unlimited Strength Encryption Policy Files checkbox.
Click Continue.
4. Specify credentials and initiate Agent installation:
• Select root or enter the user name for an account that has password-less sudo permission.
• Select an authentication method:
– If you choose password authentication, enter and confirm the password.
– If you choose public-key authentication, provide a passphrase and path to the required key
files.
• You can specify an alternate SSH port. The default value is 22.
• You can specify the maximum number of host installations to run at once. The default value is 10.
5. Click Continue. The Cloudera Manager Agent packages are installed.
6. Click Continue. The Host Inspector runs to inspect your managed hosts for correct versions and
configurations. If there are problems, you can make changes and then rerun the inspector. When you
are satisfied with the inspection results, click Finish.
• Manually install Agent software
1. On all cluster hosts except the Cloudera Manager Server host, stop the Agent:
2. In the Cloudera Admin Console, select No, I would like to skip the agent upgrade now and click Continue.
3. Copy the appropriate repo file as described in Upgrade Cloudera Manager Server (Packages) on page
461.
4. Run the following commands:
Note:
• yum clean all cleans yum cache directories, ensuring that you
download and install the latest versions of the packages.
• If your system is not up to date, any underlying system components
must be upgraded before yum update can succeed. yum indicates
which components must be upgraded.
Ubuntu or Debian Use the following commands to clean cached repository information and update
Cloudera Manager components:
$ sudo apt-get clean
$ sudo apt-get update
$ sudo apt-get dist-upgrade
3. Click Continue. The Host Inspector runs to inspect your managed hosts for correct versions and configurations.
If there are problems, you can make changes and then re-run the inspector.
4. Click Finish. If you are using an external database for Cloudera Navigator, the Database Setup page displays.
a. Configure database settings:
a. Enter the database host, database type, database name, username, and password for the database
that you created when you set up the database.
b. Click Test Connection to confirm that Cloudera Manager can communicate with the database using
the information you have supplied. If the test succeeds in all cases, click Continue; otherwise check
and correct the information you have provided for the database and then try the test again. (For some
servers, if you are using the embedded database, you will see a message saying the database will be
created at a later step in the installation process.) The Review Changes page displays.
• If you are running single user mode, stop Cloudera Manager Agent using the user account you chose. For
example, if you are running the Cloudera Manager Agent as cloudera-scm, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user, and then run the script as root:
USER=cloudera-scm
GROUP=cloudera-scm
– Edit the configuration files so the script internally changes the user, and then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
USER=cloudera-scm
GROUP=cloudera-scm
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ chkconfig cloudera-scm-agent on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ update-rc.d cloudera-scm-agent defaults
2. On each Agent, open the tarball_root/etc/init.d/cloudera-scm-agent file and change the value
of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
Property Description
SSL Client Truststore File Path to the client truststore file used in HTTPS communication. The contents
Location of this truststore can be modified without restarting the Cloudera
Management Service roles. By default, changes to its contents are picked
up within ten seconds.
SSL Client Truststore File Password for the client truststore file.
Password
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on
page 612.
2. Restart all services:
a.
From the Home page click next to the cluster name and select Restart.
b. In the confirmation dialog that displays, click Restart.
Required Role:
This process applies to upgrading all versions of Cloudera Manager 4 to Cloudera Manager 5.
In most cases, you can upgrade without shutting down most CDH services, although you may need to stop some
dependent services. CDH daemons can run unaffected while Cloudera Manager is upgraded, and the upgrade
process does not affect your CDH installation. However, to use Cloudera Manager 5 features, all services must
be restarted after the upgrade. After upgrading Cloudera Manager you may also want to upgrade CDH 4 clusters
to CDH 5.
Follow these steps to upgrade Cloudera Manager 4 to the latest version of Cloudera Manager.
Warning:
• Cloudera Management Service databases
Cloudera Manager 5 stores Host and Service Monitor data in a local datastore. The Cloudera
Manager 4 to Cloudera Manager 5 upgrade wizard automatically migrates data from existing
embedded PostgreSQL or external databases to the local datastore.For more information, see
Data Storage for Monitoring Data on page 67.
The Host Monitor and Service Monitor databases are stored on the partition hosting /var. Ensure
that you have at least 20 GB available on this partition.
If you have been storing the data in an external database, you can drop those databases after
upgrade completes.
• Cloudera Management Service SSL configuration
If you have enabled TLS security for the Cloudera Manager Admin Console, as of Cloudera Manager
5.1, Cloudera Management Service roles try to communicate with Cloudera Manager using TLS,
and fail to start until SSL properties have been configured.
• Impala
Cloudera Manager 5 supports Impala 1.2.1 or later. If the version of your Impala service is 1.1 or
earlier, the following upgrade instructions will work, but once the upgrade has completed, you will
see a validation warning for your Impala service, and you will not be able to restart your Impala
(or Hue) services until you upgrade your Impala service to 1.2.1 or later. If you want to continue
to use Impala 1.1 or earlier, do not upgrade to Cloudera Manager 5.
• Navigator
If you have enabled auditing with Cloudera Navigator, during the upgrade to Cloudera Manager 5,
auditing is suspended and is only restarted when you restart the roles of audited services.
• JDK upgrade
If you upgrade the JDK during the installation of the Cloudera Manager Agent, you must restart
all services. Additionally, if you have enabled SSL, you must reinstall CA certificates to your
truststores. See Creating Truststores.
• Hard Restart of Cloudera Manager Agents
Certain circumstances require that you hard restart the Cloudera Manager Agent on each host:
• Deploying a fix to an issue where Cloudera Manager did not always correctly restart services
• Using the maximum file descriptor feature
• Enabling HDFS DataNodes to start if you perform the step (Optional) Upgrade CDH on page 475
after upgrading Cloudera Manager
Important:
• Hive
Cloudera Manager 4.5 added support for Hive, which includes the Hive Metastore Server role type.
This role manages the metastore process when Hive is configured with a remote metastore.
When upgrading from Cloudera Manager versions before 4.5, Cloudera Manager automatically
creates new Hive services to capture the previous implicit Hive dependency from Hue and Impala.
Your previous services continue to function without impact. If Hue was using a Hive metastore
backed by a Derby database, the newly created Hive Metastore Server also uses Derby. Because
Derby does not allow concurrent connections, Hue continues to work, but the new Hive Metastore
Server does not run. The failure is harmless (because nothing uses this new Hive Metastore Server
at this point) and intentional, to preserve the set of cluster functionality as it was before upgrade.
Cloudera discourages the use of a Derby-backed Hive metastore due to its limitations and
recommends switching to a different supported database.
Cloudera Manager provides a Hive configuration option to bypass the Hive Metastore Server. When
this configuration is enabled, Hive clients, Hue, and Impala connect directly to the Hive metastore
database. Prior to Cloudera Manager 4.5, Hue and Impala connected directly to the Hive metastore
database, so the bypass mode is enabled by default when upgrading to Cloudera Manager 4.5 or
later. This ensures that the upgrade does not disrupt your existing setup. You should plan to
disable the bypass mode, especially when using CDH 4.2 or later. Using the Hive Metastore Server
is the recommended configuration, and the WebHCat Server role requires the Hive Metastore
Server to not be bypassed. To disable bypass mode, see Disabling Bypass Mode.
Cloudera Manager 4.5 or later also supports HiveServer2 with CDH 4.2. In CDH 4, HiveServer2 is
not added by default, but can be added as a new role under the Hive service (see Role Instances).
In CDH 5, HiveServer2 is a mandatory role.
Note: If you are upgrading from Cloudera Manager Free Edition 4.5 or earlier, you are upgraded to
Cloudera Express, which includes a number of features that were previously available only with
Cloudera Enterprise. Of those features, activity monitoring requires a database. Thus, upon upgrading
to Cloudera Manager 5, you must specify Activity Monitor database information. You have the option
to use the embedded PostgreSQL database, which Cloudera Manager can set up automatically.
Warning: Cloudera Manager 5 does not support CDH 3 and you cannot upgrade Cloudera Manager 4
to Cloudera Manager 5 if you have a cluster running CDH 3.Therefore, to upgrade CDH 3 clusters to
CDH 4 using Cloudera Manager, you must use Cloudera Manager 4.
• Cloudera Manager 5 supports HDFS high availability only with automatic failover. If your cluster has enabled
high availability without automatic failover, you must enable automatic failover before upgrading to Cloudera
Manager 5. See Configuring HDFS High Availability.
Condition Procedure
Running a version of Cloudera Manager that has the Stop the Cloudera Management Service.
Cloudera Management Service
Upgrading from Cloudera Manager 4.5 or later, and Stop the services that have a dependency on the Hive
using the embedded PostgreSQL database for the Hive metastore (Hue, Impala, and Hive). You cannot stop the
metastore Cloudera Manager Server database while these services
are running. If you attempt to upgrade while the
embedded database is running, the upgrade fails. Stop
services that depend on the Hive metastore in the
following order:
1. Stop the Hue and Impala services.
2. Stop the Hive service.
Running Cloudera Navigator Stop any of the following roles whose service's Queue
Policy configuration
(navigator.batch.queue_policy) is set to
SHUTDOWN:
• HDFS - NameNode
• HBase - Master and RegionServers
• Hive - HiveServer2
• Hue - Beeswax Server
Stopping these roles renders any service depending on
these roles unavailable. For HDFS - NameNode, this
implies most of the services in the cluster will be
unavailable until the upgrade is finished.
3. If you are using the embedded PostgreSQL database for Cloudera Manager, stop the database:
Important: If you are not running the embedded database service and you attempt to stop it, you
receive a message indicating that the service cannot be found. If instead you get a message that
the shutdown failed, the embedded database is still running, probably because services are
connected to the Hive metastore. If the database shutdown fails due to connected services, issue
the following command:
4. If the Cloudera Manager host is also running the Cloudera Manager Agent, stop the Cloudera Manager Agent:
(Optional) Upgrade the JDK on Cloudera Manager Server and Agent Hosts
If you are manually upgrading the Cloudera Manager Agent software in Upgrade and Start Cloudera Manager
Agents (Packages) on page 467 or Install Cloudera Manager Server and Agent Software (Tarballs) on page 463,
and you are upgrading to CDH 5, install the Oracle JDK on the Agent hosts as described in Java Development Kit
Installation on page 41.
If you are not running Cloudera Manager Server on the same host as a Cloudera Manager Agent, and you want
all hosts to run the same JDK version, optionally install the Oracle JDK on that host.
[cloudera-manager]
# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera Manager
baseurl=http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5/
gpgkey = http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/RPM-GPG-KEY-cloudera
gpgcheck = 1
For Ubuntu or Debian systems, navigate to the appropriate release directory, for example,
http://archive.cloudera.com/cm4/debian/wheezy/amd64/cm. The repo file, in this case,
cloudera.list, is similar to the following:
b. Replace the repo file in the configuration location for the package management software for your system.
Note:
• yum clean all cleans yum cache directories, ensuring that you
download and install the latest versions of the packages.
• If your system is not up to date, any underlying system components
must be upgraded before yum update can succeed. yum indicates
which components must be upgraded.
Ubuntu or Debian The following commands clean cached repository information and update Cloudera
Manager components:
$ sudo apt-get clean
$ sudo apt-get update
$ sudo apt-get dist-upgrade
$ sudo apt-get install cloudera-manager-server
cloudera-manager-agent cloudera-manager-daemons
During this process, you may be prompted about your configuration file version:
Configuration file `/etc/cloudera-scm-agent/config.ini'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
You will receive a similar prompt for /etc/cloudera-scm-server/db.properties.
Answer N to both prompts.
You should now have the following packages, corresponding to the version of Cloudera Manager you installed,
on the host that will be the Cloudera Manager Server host.
OS Packages
RPM-based distributions $ rpm -qa 'cloudera-manager-*'
cloudera-manager-agent-5.4.2-0.cm5.p0.932.el6.x86_64
cloudera-manager-server-5.4.2-0.cm5.p0.932.el6.x86_64
cloudera-manager-daemons-5.4.2-0.cm5.p0.932.el6.x86_64
You may also see an entry for the cloudera-manager-server-db-2 if you are using the embedded database,
and additional packages for plug-ins, depending on what was previously installed on the server host. If the
cloudera-manager-server-db-2 package is installed, and you do not plan to use the embedded database,
you can remove this package.
Install Cloudera Manager Server and Agent Software (Tarballs)
Tarballs contain both the Cloudera Manager Server and Cloudera Manager Agent in a single file. Download
tarballs from the locations listed in Cloudera Manager Version and Download Information. Copy the tarballs and
unpack them on all hosts on which you intend to install Cloudera Manager Server and Cloudera Manager Agents,
in a directory of your choosing. If necessary, create a new directory to accommodate the files you extract from
the tarball. For instance, if /opt/cloudera-manager does not exist, create it using a command similar to:
When you have a directory to which to extract the contents of the tarball, extract the contents. For example, to
copy a tar file to your home directory and extract the contents of all tar files to the /opt/ directory, use a
command similar to the following:
The files are extracted to a subdirectory named according to the Cloudera Manager version being extracted. For
example, files could extract to /opt/cloudera-manager/cm-5.0/. This full path is needed later and is referred
to as tarball_root directory.
Create Users
The Cloudera Manager Server and managed services require a user account to complete tasks. When installing
Cloudera Manager from tarballs, you must create this user account on all hosts manually. Because Cloudera
Manager Server and managed services are configured to use the user account cloudera-scm by default, creating
a user with this name is the simplest approach. This created user, is used automatically after installation is
complete.
To create user cloudera-scm, use a command such as the following:
Ensure the --home argument path matches your environment. This argument varies according to where you
place the tarball, and the version number varies among releases. For example, the --home location could be
/opt/cm-5.0/run/cloudera-scm-server.
Property Description
server_host Name of the host where Cloudera Manager Server is running.
server_port Port on the host where Cloudera Manager Server is running.
• By default, a tarball installation has a var subdirectory where state is stored. In a non-tarball installation,
state is stored in /var. Cloudera recommends that you reconfigure the tarball installation to use an external
directory as the /var equivalent (/var or any other directory outside the tarball) so that when you upgrade
Cloudera Manager, the new tarball installation can access this state. Configure the installation to use an
external directory for storing state by editing tarball_root/etc/default/cloudera-scm-agent and setting
the CMF_VAR variable to the location of the /var equivalent. If you do not reuse the state directory between
different tarball installations, duplicate Cloudera Manager Agent entries can occur in the Cloudera Manager
database.
If you are using a custom username and custom directories for Cloudera Manager, you must create these
directories on the Cloudera Manager Server host and assign ownership of these directories to the custom
username. Cloudera Manager installer makes no changes to any directories that already exist. Cloudera Manager
cannot write to any existing directories for which it does not have proper permissions, and if you do not change
ownership, Cloudera Management Service roles may not perform as expected. To resolve these issues, do one
of the following:
mkdir /var/cm_logs/cloudera-scm-headlamp
chown cloudera-scm /var/cm_logs/cloudera-scm-headlamp
Note: The configuration property for the Cloudera Manager Server Local Data Storage Directory
(default value is: /var/lib/cloudera-scm-server) is located on a different page:
1. Select Administration > Settings.
2. Type directory in the Search box.
3. Enter the directory path in the Cloudera Manager Server Local Data Storage Directory
property.
Starting cloudera-scm-server: [ OK ]
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on page
612.
Start the Cloudera Manager Server (Tarballs)
The way in which you start the Cloudera Manager Server varies according to what account you want the Server
to run under:
• As root:
• As another user. If you run as another user, ensure the user you created for Cloudera Manager owns the
location to which you extracted the tarball including the newly created database files. If you followed the
earlier examples and created the directory /opt/cloudera-manager and the user cloudera-scm, you could
use the following command to change ownership of the directory:
Once you have established ownership of directory locations, you can start Cloudera Manager Server using
the user account you chose. For example, you might run the Cloudera Manager Server as cloudera-service.
In this case, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user. Then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-server:
USER=cloudera-service
GROUP=cloudera-service
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ chkconfig cloudera-scm-server on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-server
/etc/init.d/cloudera-scm-server
$ update-rc.d cloudera-scm-server defaults
2. On the Cloudera Manager Server host, open the /etc/init.d/cloudera-scm-server file and change
the value of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on page
612.
Important: All hosts in the cluster must have access to the Internet if you use archive.cloudera.com
as the source for installation files. If you do not have Internet access, create a custom repository.
2. In the Cloudera Admin Console, select No, I would like to skip the agent upgrade now and click Continue.
3. Copy the appropriate repo file as described in Upgrade Cloudera Manager Server (Packages) on page
461.
Note:
• yum clean all cleans yum cache directories, ensuring that you
download and install the latest versions of the packages.
• If your system is not up to date, any underlying system components
must be upgraded before yum update can succeed. yum indicates
which components must be upgraded.
Ubuntu or Debian Use the following commands to clean cached repository information and update
Cloudera Manager components:
$ sudo apt-get clean
$ sudo apt-get update
$ sudo apt-get dist-upgrade
$ sudo apt-get install cloudera-manager-agent
cloudera-manager-daemons
During this process, you may be prompted about your configuration file version:
Configuration file `/etc/cloudera-scm-agent/config.ini'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
You will receive a similar prompt for
/etc/cloudera-scm-server/db.properties. Answer N to both prompts.
3. If you are upgrading from a free version of Cloudera Manager prior to 4.6:
a. Click Continue to assign the Cloudera Management Services roles to hosts.
b. If you are upgrading to Cloudera Enterprise, specify required databases:
a. Configure database settings:
a. Choose the database type:
• Keep the default setting of Use Embedded Database to have Cloudera Manager create and
configure required databases. Record the auto-generated passwords.
4. Click Finish.
5. If you are upgrading from Cloudera Manager prior to 4.5:
a. Select the host for the Hive Metastore Server role.
b. Review the configuration values and click Accept to continue.
Note:
• If Hue was using a Hive metastore backed by a Derby database, the newly created Hive
Metastore Server also uses Derby. Because Derby does not allow concurrent connections,
Hue continues to work, but the new Hive Metastore Server does not run. The failure is
harmless (because nothing uses this new Hive Metastore Server at this point) and intentional,
to preserve the set of cluster functionality as it was before upgrade. Cloudera discourages
the use of a Derby-backed Hive metastore due to its limitations and recommends switching
to a different supported database.
• Prior to Cloudera Manager 4.5, Hue and Impala connected directly to the Hive metastore
database, so the bypass mode is enabled by default when upgrading to Cloudera Manager
4.5 or later. This ensures that the upgrade does not disrupt your existing setup. You should
plan to disable the bypass mode, especially when using CDH 4.2 or later. Using the Hive
Metastore Server is the recommended configuration, and the WebHCat Server role requires
the Hive Metastore Server to not be bypassed. To disable bypass mode, see Disabling Bypass
Mode. After changing this configuration, you must redeploy your client configurations,
restart Hive, and restart any Hue or Impala services configured to use that Hive.
• If you are using CDH 4.0 or CDH 4.1, see known issues related to Hive in Known Issues and
Workarounds in Cloudera Manager 5.
6. If you are upgrading from Cloudera Manager 4.5 or earlier, correct the Hive home directory permissions
(/user/hive) as follows:
7. If you are upgrading from Cloudera Manager prior to 4.8 and have an Impala service, assign the Impala Catalog
Server role to a host.
8. Review the configuration changes to be applied.
9. Click Finish.
All services (except for the services you stopped in Stop Selected Services and Roles on page 460) should be
running.
Restart Cloudera Manager Agents (Tarballs)
Stop Cloudera Manager Agents (Tarballs)
• To stop the Cloudera Manager Agent, run this command on each Agent host:
• If you are running single user mode, stop Cloudera Manager Agent using the user account you chose. For
example, if you are running the Cloudera Manager Agent as cloudera-scm, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user, and then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
USER=cloudera-scm
GROUP=cloudera-scm
– Edit the configuration files so the script internally changes the user, and then run the script as root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
USER=cloudera-scm
GROUP=cloudera-scm
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ chkconfig cloudera-scm-agent on
• Debian/Ubuntu
$ cp tarball_root/etc/init.d/cloudera-scm-agent /etc/init.d/cloudera-scm-agent
$ update-rc.d cloudera-scm-agent defaults
2. On each Agent, open the tarball_root/etc/init.d/cloudera-scm-agent file and change the value
of CMF_DEFAULTS from ${CMF_DEFAULTS:-/etc/default} to tarball_root/etc/default.
Upgrade Impala
If your version of Impala is 1.1 or earlier, upgrade to Impala 1.2.1 or later.
Property Description
SSL Client Truststore File Path to the client truststore file used in HTTPS communication. The contents
Location of this truststore can be modified without restarting the Cloudera
Management Service roles. By default, changes to its contents are picked
up within ten seconds.
SSL Client Truststore File Password for the client truststore file.
Password
• Tarballs
– To stop the Cloudera Manager Agent, run this command on each Agent host:
– If you are running single user mode, start Cloudera Manager Agent using the user account you chose.
For example to run the Cloudera Manager Agent as cloudera-scm, you have the following options:
– Run the following command:
– Edit the configuration files so the script internally changes the user, and then run the script as
root:
1. Remove the following line from tarball_root/etc/default/cloudera-scm-agent:
USER=cloudera-scm
GROUP=cloudera-scm
• Tarballs
If the Cloudera Manager Server does not start, see Troubleshooting Installation and Upgrade Problems on
page 612.
2. If you have not restarted services in previous steps, restart all services:
a.
On the Home page, click next to the cluster name and select Restart.
b. In the confirmation dialog, click Restart.
Warning: Cloudera Manager 3 and CDH 3 have reached End of Maintenance (EOM) as of June 20,
2013. Cloudera does not support or provide patches for Cloudera Manager 3 and CDH 3 releases.
You cannot upgrade directly from Cloudera Manager 3.7.x to Cloudera Manager 5; you must upgrade to Cloudera
Manager 4 first before upgrading to Cloudera Manager 5. Follow the instructions for upgrading Cloudera Manager
3.7.x to Cloudera Manager 4 in Upgrade Cloudera Manager 3.7.x to the Latest Cloudera Manager.
The last step in the Cloudera Manager upgrade process is an optional step to upgrade CDH. If you are running
CDH 3, this step is not optional. Cloudera Manager 5 does not support CDH 3 and will not allow you to complete
the upgrade if it detects a CDH 3 cluster. You must upgrade to CDH 4 before you can upgrade to Cloudera Manager
5. Follow the steps in Upgrading CDH 3 to CDH 4 in a Cloudera Manager Deployment before you attempt to
upgrade to Cloudera Manager 5.
Required Role:
The first time you log in to the Cloudera Manager server after upgrading your Cloudera Manager software, the
upgrade wizard runs. If you did not complete the wizard at that time, or if you had hosts that were unavailable
at that time and still need to be upgraded, you can re-run the upgrade wizard:
1. Click the Hosts tab.
2. Click Re-run Upgrade Wizard. This takes you back through the installation wizard to upgrade Cloudera
Manager Agents on your hosts as necessary.
3. Select the release of the Cloudera Manager Agent to install. Normally, this is the Matched Release for this
Cloudera Manager Server. However, if you used a custom repository (instead of archive.cloudera.com) for
the Cloudera Manager server, select Custom Repository and provide the required information. The custom
repository allows you to use an alternative location, but that location must contain the matched Agent
version.
4. Specify credentials and initiate Agent installation:
• Select root or enter the user name for an account that has password-less sudo permission.
• Select an authentication method:
– If you choose password authentication, enter and confirm the password.
– If you choose public-key authentication, provide a passphrase and path to the required key files.
• You can specify an alternate SSH port. The default value is 22.
• You can specify the maximum number of host installations to run at once. The default value is 10.
When you click Continue the Cloudera Manager Agent is upgraded on all the currently managed hosts. You
cannot search for new hosts through this process. To add hosts to your cluster, click the Add New Hosts to
Cluster button.
Important: The following instructions assume that a Cloudera Manager upgrade failed, and that the
upgraded server never started, so that the remaining steps of the upgrade process were not performed.
The steps below are not sufficient to revert from a running Cloudera Manager 5 deployment.
2. Reinstall the same Cloudera Manager Server version that you were previously running. You can reinstall
from the Cloudera repository at http://archive.cloudera.com/cm4/ or
http://archive.cloudera.com/cm5/ or alternately, you can create your own repository, as described in
Understanding Custom Installation Solutions on page 135.
a. Find the Cloudera repo file for your distribution by starting at http://archive.cloudera.com/cm4/ or
http://archive.cloudera.com/cm5/and navigating to the directory that matches your operating
system.
For example, for Red Hat or CentOS 6, you would navigate to
http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/. Within that directory, find the repo file that
contains information including the repository's base URL and GPG key. On CentOS 6, the contents of the
cloudera-manager.repo file might appear as follows:
[cloudera-manager]
# Packages for Cloudera Manager, Version 5, on RedHat or CentOS 6 x86_64
name=Cloudera Manager
baseurl=http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5/
gpgkey = http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/RPM-GPG-KEY-cloudera
gpgcheck = 1
For Ubuntu or Debian systems, the repo file can be found by navigating to the appropriate directory, for
example,
http://archive.cloudera.com/cm5/debian/wheezy/amd64/cmhttp://archive.cloudera.com/cm4/debian/squeeze/amd64/cm.
The repo file, in this case, cloudera.list, may appear as follows:
You must edit the file if it exist and modify the URL to reflect the exact version of Cloudera Manager you
are using (unless you want the downgrade to also upgrade to the latest version of Cloudera Manager 4).
The possible versions are shown in the directory on archive. Setting the URL (an example):
OS Command
RHEL Replace baseurl=http://archive.cloudera.com/cm5/redhat/5/x86_64/cm/5/
with
baseurl=http://archive.cloudera.com/cm5/redhat/5/x86_64/cm/5.0.5/
b. Copy the repo file to the configuration location for the package management software for your system:
Ubuntu or Debian There's no action that will downgrade to the version currently in the repository.
Read DowngradeHowto, download the script described therein, run it, and then run
apt-get install for the name=version pairs that it provides for Cloudera
Manager.
At the end of this process you should have the following packages, corresponding to the version of Cloudera
Manager you installed, on the Cloudera Manager Server host. For example, for CentOS,
For Ubuntu or Debian, you should have packages similar to those shown below.
~# dpkg-query -l 'cloudera-manager-*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Description
+++-======================-======================-============================================================
ii cloudera-manager-agent 5.0.5-1.cm505.p0.163~sq The Cloudera Manager Agent
ii cloudera-manager-daemo 5.0.5-1.cm505.p0.163~sq Provides daemons for monitoring
Hadoop and related tools.
ii cloudera-manager-serve 5.0.5-1.cm505.p0.163~sq The Cloudera Manager Server
You may also see an entry for the cloudera-manager-server-db if you are using the embedded database,
and additional packages for plug-ins, depending on what was previously installed on the server host. If the
commands to update the server complete without errors, you can assume the upgrade has completed as desired.
For additional assurance, you will have the option to check that the server versions have been updated after
you start the server.
Starting cloudera-scm-server: [ OK ]
Note: If you have problems starting the server, such as database permissions problems, you can
use the server's log /var/log/cloudera-scm-server/cloudera-scm-server.log to
troubleshoot the problem.
Important:
Cloudera does not provide an upgrade path from the Navigator Metadata component which was a
beta release in Cloudera Navigator 1.2 to the Cloudera Navigator 2 release. If you are upgrading from
Cloudera Navigator 1.2 (included with Cloudera Manager 5.0), you must perform a clean install of
Cloudera Navigator 2. Therefore, if you have Cloudera Navigator roles from a 1.2 release:
1. Delete the Navigator Metadata Server role.
2. Remove the contents of the Navigator Metadata Server storage directory.
3. Add the Navigator Metadata Server role according to the process described in Adding the Navigator
Metadata Server Role.
4. Clear the cache of any browser that had used the 1.2 release of the Navigator Metadata component.
Otherwise, you may observe errors in the Navigator Metadata UI.
Related Information
• Cloudera Navigator 2 Overview
• Installing Cloudera Navigator on page 163
• Cloudera Navigator Administration
• Cloudera Data Management
• Configuring Authentication in Cloudera Navigator
• Configuring SSL for Cloudera Navigator
• Cloudera Navigator User Roles
Cloudera Manager 5 supports clusters running CDH 4 and CDH 5. To ensure the highest level of functionality
and stability, consider upgrading to the most recent version of CDH.
The Cloudera Manager minor version must always be equal to or greater than the CDH minor version because
older versions of Cloudera Manager may not support features in newer versions of CDH. For example, if you
want to upgrade to CDH 5.1.2 you must first upgrade to Cloudera Manager 5.1 or higher.
Cloudera Manager 5.3 introduces an enhanced CDH upgrade wizard that supports major (CDH 4 to CDH 5), minor
(CDH 5.x to 5.y), and maintenance upgrades (CDH a.b.x to CDH a.b.y). Both parcels and package installations are
supported, but packages must be manually installed, whereas parcels are installed by Cloudera Manager. For
an easier upgrade experience, consider switching from packages to parcels so Cloudera Manager can automate
more of the process.
Depending on the nature of the changes in CDH between the old and new versions, the enhanced upgrade wizard
performs service-specific upgrades that in the past you would have had to perform manually. When you start
the wizard, notices regarding steps you must perform before upgrading to safeguard existing data that will be
upgraded by the wizard.
If you use parcels, have a Cloudera Enterprise license, and have enabled HDFS high availability, you can perform
a rolling upgrade that lets you avoid cluster downtime.
See the following topics for details on upgrading to the specific versions of CDH 4 or CDH 5.
Required Role:
Cloudera Manager has version-specific features based on the minor and patch versions. For example, Sqoop 2
and the Hive HiveServer2 and WebHCat roles are only available for CDH 4.2.0 or above. These behaviors are
controlled by what is configured in Cloudera Manager, not what is actually installed on the hosts. The versions
should be in sync, and is the case if parcels are used.
In package-based clusters, you can manually upgrade CDH packages. For example, you can upgrade the packages
from CDH 4.1.0 to CDH 4.2.1. However, in previous releases Cloudera Manager did detect this change and behaved
as if the cluster was 4.1.0. It such cases, it would not display Sqoop 2 as a service. You would have to set the
CDH version manually using the cluster Configure CDH Version action.
Cloudera Manager now sets the CDH version correctly. However, if you had an older Cloudera Manager and forgot
to set the version, and then upgraded to latest Cloudera Manager, you would need to set the version.
To inform Cloudera Manager of the CDH version, select ClusterName > Configure CDH Version. In the dialog,
Cloudera Manager displays the installed CDH version, and asks for confirmation to configure itself with the new
version. The dialog will also detect if a major upgrade was done, and direct you to use the major upgrade flow
documented in Upgrading from CDH 4 Packages to CDH 5 Packages on page 555.
Required Role:
Important: This feature is available only with a Cloudera Enterprise license; it is not available in
Cloudera Express. For information on Cloudera Enterprise licenses, see Managing Licenses.
The rolling upgrade feature takes advantage of parcels and the HDFS high availability to enable you to upgrade
your cluster software and restart the upgraded services without taking the entire cluster down. You must have
HDFS high availability enabled to perform a rolling upgrade.
This page described how to perform a rolling upgrade between maintenance and minor versions of CDH 5, except
Beta versions. For rolling upgrade between CDH 4 versions, see Performing a Rolling Upgrade on a CDH 4 Cluster
on page 483.
It is not possible to perform a rolling upgrade from CDH 4 to CDH 5 because of incompatibilities between the
two major versions. Instead, follow the instructions for a full upgrade at Upgrading from CDH 4 to CDH 5 Parcels
on page 548.
The steps to perform a rolling upgrade of a cluster are as follows:
rolling restart operation. If you have JobTracker high availability configured, Cloudera Manager will fail over the
JobTracker during the rolling restart, but this is not a requirement for performing a rolling upgrade.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Required Role:
Important: This feature is available only with a Cloudera Enterprise license; it is not available in
Cloudera Express. For information on Cloudera Enterprise licenses, see Managing Licenses.
The rolling upgrade feature takes advantage of parcels and the HDFS high availability to enable you to upgrade
your cluster software and restart the upgraded services without taking the entire cluster down. You must have
HDFS high availability enabled to perform a rolling upgrade.
This page described how to perform a rolling upgrade between minor versions of CDH 4. For rolling upgrade
between CDH 5 versions, see Performing a Rolling Upgrade on a CDH 5 Cluster on page 480.
It is not possible to perform a rolling upgrade from CDH 4 to CDH 5 because of incompatibilities between the
two major versions. Instead, follow the instructions for a full upgrade at Upgrading from CDH 4 to CDH 5 Parcels
on page 548.
A rolling upgrade involves two steps:
1. Download, distribute, and activate the parcel for the new software you want to install.
2. Perform a rolling restart to restart the services in your cluster. You can do a rolling restart of individual
services, or if you have high availability enabled, you can perform a restart of the entire cluster. Cloudera
Manager will manually fail over your NameNode at the appropriate point in the process so that your cluster
will not be without a functional NameNode.
To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks and
configuration validations from being made. Be sure to exit maintenance mode when you have finished the
upgrade in order to re-enable Cloudera Manager alerts.
The steps to perform a rolling upgrade of a cluster are as follows:
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Note: If you have just upgraded your Cloudera Manager deployment to 4.6, and are now doing a
rolling upgrade of your cluster, you must ensure that MapReduce is restarted before the rest of
your services, or the restart may fail. This is necessary to ensure the MapReduce configuration
changes are propagated.
Further, if you are upgrading from CDH 4.1 with Impala to CDH 4.2 or 4.3, you must restart
MapReduce before Impala restarts (by default Impala is restarted before MapReduce).
The workaround is to perform a restart of MapReduce alone as the first step, then perform a
cluster restart of the remaining services.
Important: Removing the Hue Common package will remove your Hue database; if you do not
back it up you may lose all your Hue user account information.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
If you use parcels, have a Cloudera Enterprise license, and have enabled HDFS high availability, you can perform
a rolling upgrade that lets you avoid cluster downtime.
Required Role:
Use the instructions in this section to upgrade to a CDH maintenance release, that is from CDH a.b.x to CDH
a.b.y. For example, CDH 4.7.0 to CDH 4.7.1 or CDH 5.1.0 to 5.1.4.
You can upgrade your cluster to another maintenance version using parcels from within the Cloudera Manager
Admin Console. Your current CDH cluster can have been installed with either parcels or packages. The new
version will use parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using parcels, the steps are as follows.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Required Role:
Use the instructions in this section to upgrade to a CDH maintenance release, that is from CDH a.b.x to CDH
a.b.y. For example, CDH 4.7.0 to CDH 4.7.1 or CDH 5.1.0 to 5.1.4.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using packages, the steps are as follows.
• On SLES systems:
1. Run the following command:
2. Edit the repo file to point to the release you want to install or upgrade to.
• On Red Hat-compatible systems:
Open the repo file you have just saved and change the 5 at the end of the line that begins baseurl= to
the version number you want.
For example, if you have saved the file for Red Hat 6, it will look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
• On SLES systems:
Open the repo file that you have just added to your system and change the 5 at the end of the line that
begins baseurl= to the version number you want.
The file should look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl= http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
– Red Hat/CentOS/Oracle 6
• SLES
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
– Debian Wheezy
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• SLES
Note: Installing these packages will also install all the other CDH packages that are needed for a
full CDH 5 installation.
2.
From the Home tab Status page, click next to the cluster name and select Upgrade Cluster. The Upgrade
Wizard starts.
3. In the Choose Method field, click the Use Packages radio button.
4. In the Choose CDH Version (Packages) field, specify the CDH version of the packages you have installed on
your cluster. Click Continue.
5. Read the notices for steps you must complete before upgrading, click the Yes, I ... checkboxes after completing
the steps, and click Continue.
6. Cloudera Manager checks that hosts have the correct software installed. If the packages have not been
installed, a warning displays to that effect. Install the packages and click Check Again. When there are no
errors, click Continue.
7. The Host Inspector runs and displays the CDH version on the hosts. Click Continue. The Shut down and
upgrade the cluster screen displays.
8. Choose the type of upgrade and restart:
• Cloudera Manager upgrade - Cloudera Manager performs all service upgrades and restarts the cluster.
1. Click Continue. The Command Progress screen displays the result of the commands run by the wizard
as it shuts down all services, upgrades services as necessary, deploys client configuration files, and
restarts services.
2. Click Continue. The wizard reports the result of the upgrade.
• Manual upgrade - Select the Let me upgrade the cluster checkbox. Cloudera Manager configures the
cluster to the specified CDH version but performs no upgrades or service restarts. Manually doing the
upgrade is difficult and is for advanced users only.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
Required Role:
You can upgrade your CDH 5 cluster to CDH 5.4 using parcels from within the Cloudera Manager Admin Console.
Your current CDH 5 cluster can have been installed with either parcels or packages. The new version will use
parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using parcels, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Finalize the HDFS metadata upgrade. It is not unusual to wait days or even weeks before finalizing the upgrade.
To determine when finalization is warranted, run important workloads and ensure they are successful. Once
you have finalized the upgrade, it is not possible to roll back to a previous version of HDFS without using backups.
1. Go to the HDFS service.
2. Click the Instances tab.
3. Click the NameNode instance.
4. Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.
Upgrade Wizard Actions
Do the steps in this section only if you chose a manual upgrade or the upgrade wizard reports a failure.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Required Role:
If you originally used Cloudera Manager to install CDH 5 using packages, you can upgrade to CDH 5.4 using either
packages or parcels. Using parcels is recommended, because the upgrade wizard for parcels handles the upgrade
almost completely automatically.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using packages, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
4. From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories
property. If more than one is listed, then you only need to make a backup of one directory, since each directory
is a complete copy. For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
If you are upgrading from CDH 5 Beta 1 or later, and you used the "1-click" package for the previous
CDH 5 release, you should see:
CDH5-repository-1-0
In this case, skip to installing the CDH 5 packages. If instead you see:
If the repository is installed, skip to installing the CDH 5 packages; otherwise proceed with installing
the "1-click" package.
2. If the CDH 5 "1-click" repository is not already installed on each host in the cluster, follow the instructions
below for that host's operating system.
• Red Hat compatible
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click the entry in the table below that matches your Red Hat or CentOS system, choose Save
File, and save the file to a directory to which you have write access (for example, your home
directory).
• Red Hat/CentOS/Oracle 6
• Red Hat/CentOS/Oracle 6
• SLES
1. Download and install the "1-click Install" package:
a. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (for
example, your home directory).
b. Install the RPM:
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/trusty/amd64/cdh/archive.key |
sudo apt-key add -
• Ubuntu Precise
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
• Debian Wheezy
$ curl -s
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key |
sudo apt-key add -
• SLES
Note: Installing these packages will also install all the other CDH packages that are needed
for a full CDH 5 installation.
• Use your operating system's package management tools to update all packages to the latest version using
standard repositories. This approach works well because it minimizes the amount of configuration required
and uses the simplest commands. Be aware that this can take a considerable amount of time if you have
not upgraded the system recently. To update all packages on your system, use the following command:
Important: Following these instructions will install the required software to add the Key Trustee KMS
service to your cluster; this enables you to use an existing Cloudera Navigator Key Trustee Server as
the underlying keystore for HDFS Data At Rest Encryption. This does not install Cloudera Navigator
Key Trustee Server. Contact Cloudera Support for Key Trustee Server documentation or assistance
deploying Key Trustee Server.
2. Add the repository to your system, using the appropriate procedure for your operating system:
• RHEL-compatible
Download the repository and copy it to the /etc/yum.repos.d/ directory. Refresh the package index by
running sudo yum clean all.
• SLES
Add the repository to your system using the following command:
3. Install the keytrustee-keyprovider package, using the appropriate command for your operating system:
• RHEL-compatible
• SLES
• Ubuntu or Debian
6. Cloudera Manager checks that hosts have the correct software installed. If the packages have not been
installed, a warning displays to that effect. Install the packages and click Check Again. When there are no
errors, click Continue.
7. The Host Inspector runs and displays the CDH version on the hosts. Click Continue. The Shut down and
upgrade the cluster screen displays.
8. Choose the type of upgrade and restart:
• Cloudera Manager upgrade - Cloudera Manager performs all service upgrades and restarts the cluster.
1. Click Continue. The Command Progress screen displays the result of the commands run by the wizard
as it shuts down all services, upgrades services as necessary, deploys client configuration files, and
restarts services.
2. Click Continue. The wizard reports the result of the upgrade.
• Manual upgrade - Select the Let me upgrade the cluster checkbox. Cloudera Manager configures the
cluster to the specified CDH version but performs no upgrades or service restarts. Manually doing the
upgrade is difficult and is for advanced users only.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
Required Role:
You can upgrade your CDH 5 cluster to CDH 5.3 using parcels from within the Cloudera Manager Admin Console.
Your current CDH 5 cluster can have been installed with either parcels or packages. The new version will use
parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using parcels, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
2.
From the Home tab Status page, click next to the cluster name and select Upgrade Cluster. The Upgrade
Wizard starts.
3. If the option to pick between packages and parcels displays, click the Use Parcels radio button.
4. In the Choose CDH Version (Parcels) field, select the CDH version. If there are no qualifying parcels, click the
click here link to go to the Parcel Configuration Settings on page 88 page where you can add the locations
of parcel repositories. Click Continue.
5. Read the notices for steps you must complete before upgrading, click the Yes, I ... checkboxes after completing
the steps, and click Continue.
6. Cloudera Manager checks that hosts have the correct software installed. Click Continue.
7. The selected parcels are downloaded and distributed. Click Continue.
8. The Host Inspector runs and displays the CDH version on the hosts. Click Continue. The Shut down and
upgrade the cluster screen displays.
9. Choose the type of upgrade and restart:
• Cloudera Manager upgrade - Cloudera Manager performs all service upgrades and restarts the cluster.
1. Click Continue. The Command Progress screen displays the result of the commands run by the wizard
as it shuts down all services, activates the new parcel, upgrades services as necessary, deploys client
configuration files, and restarts services.
2. Click Continue. The wizard reports the result of the upgrade.
• Manual upgrade - Select the Let me upgrade the cluster checkbox. Cloudera Manager configures the
cluster to the specified CDH version but performs no upgrades or service restarts. Manually doing the
upgrade is difficult and is for advanced users only.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Required Role:
If you originally used Cloudera Manager to install CDH 5 using packages, you can upgrade to CDH 5.3 using either
packages or parcels. Using parcels is recommended, because the upgrade wizard for parcels handles the upgrade
almost completely automatically.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using packages, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
• On SLES systems:
1. Run the following command:
2. Edit the repo file to point to the release you want to install or upgrade to.
• On Red Hat-compatible systems:
Open the repo file you have just saved and change the 5 at the end of the line that begins baseurl= to
the version number you want.
For example, if you have saved the file for Red Hat 6, it will look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
• On SLES systems:
Open the repo file that you have just added to your system and change the 5 at the end of the line that
begins baseurl= to the version number you want.
The file should look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl= http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
– Red Hat/CentOS/Oracle 6
• SLES
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
– Debian Wheezy
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• SLES
Note: Installing these packages will also install all the other CDH packages that are needed for a
full CDH 5 installation.
Important: Following these instructions will install the required software to add the Key Trustee KMS
service to your cluster; this enables you to use an existing Cloudera Navigator Key Trustee Server as
the underlying keystore for HDFS Data At Rest Encryption. This does not install Cloudera Navigator
Key Trustee Server. Contact Cloudera Support for Key Trustee Server documentation or assistance
deploying Key Trustee Server.
2. Add the repository to your system, using the appropriate procedure for your operating system:
• RHEL-compatible
Download the repository and copy it to the /etc/yum.repos.d/ directory. Refresh the package index by
running sudo yum clean all.
• SLES
Add the repository to your system using the following command:
3. Install the keytrustee-keyprovider package, using the appropriate command for your operating system:
• RHEL-compatible
• SLES
• Ubuntu or Debian
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
Required Role:
You can upgrade your CDH 5 cluster to CDH 5.2 using parcels from within the Cloudera Manager Admin Console.
Your current CDH 5 cluster can have been installed with either parcels or packages. The new version will use
parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using parcels, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Finalize the HDFS metadata upgrade. It is not unusual to wait days or even weeks before finalizing the upgrade.
To determine when finalization is warranted, run important workloads and ensure they are successful. Once
you have finalized the upgrade, it is not possible to roll back to a previous version of HDFS without using backups.
1. Go to the HDFS service.
2. Click the Instances tab.
3. Click the NameNode instance.
4. Select Actions > Finalize Metadata Upgrade and click Finalize Metadata Upgrade to confirm.
Upgrade Wizard Actions
Do the steps in this section only if you chose a manual upgrade or the upgrade wizard reports a failure.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Required Role:
If you originally used Cloudera Manager to install CDH 5 using packages, you can upgrade to CDH 5.2 using either
packages or parcels. Using parcels is recommended, because the upgrade wizard for parcels handles the upgrade
almost completely automatically.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using packages, the steps are as follows.
• Date partition columns: as of Hive version 13, implemented in CDH 5.2, Hive validates the format of dates in
partition columns, if they are stored as dates. A partition column with a date in invalid form can neither be
used nor dropped once you upgrade to CDH 5.2 or higher. To avoid this problem, do one of the following:
– Fix any invalid dates before you upgrade. Hive expects dates in partition columns to be in the form
YYYY-MM-DD.
– Store dates in partition columns as strings or integers.
You can use the following SQL query to find any partition-column values stored as dates:
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
4. From the command line on the NameNode host, back up the directory listed in the NameNode Data Directories
property. If more than one is listed, then you only need to make a backup of one directory, since each directory
is a complete copy. For example, if the data directory is /mnt/hadoop/hdfs/name, do the following as root:
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the CDH services.
• On SLES systems:
1. Run the following command:
2. Edit the repo file to point to the release you want to install or upgrade to.
• On Red Hat-compatible systems:
Open the repo file you have just saved and change the 5 at the end of the line that begins baseurl= to
the version number you want.
For example, if you have saved the file for Red Hat 6, it will look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
• On SLES systems:
Open the repo file that you have just added to your system and change the 5 at the end of the line that
begins baseurl= to the version number you want.
The file should look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl= http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
– Red Hat/CentOS/Oracle 6
• SLES
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
– Debian Wheezy
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• SLES
Note: Installing these packages will also install all the other CDH packages that are needed for a
full CDH 5 installation.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
Required Role:
You can upgrade your CDH 5 cluster to CDH 5.1 using parcels from within the Cloudera Manager Admin Console.
Your current CDH 5 cluster can have been installed with either parcels or packages. The new version will use
parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
To upgrade CDH using parcels, the steps are as follows.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
3. If the option to pick between packages and parcels displays, click the Use Parcels radio button.
4. In the Choose CDH Version (Parcels) field, select the CDH version. If there are no qualifying parcels, click the
click here link to go to the Parcel Configuration Settings on page 88 page where you can add the locations
of parcel repositories. Click Continue.
5. Read the notices for steps you must complete before upgrading, click the Yes, I ... checkboxes after completing
the steps, and click Continue.
6. Cloudera Manager checks that hosts have the correct software installed. Click Continue.
7. The selected parcels are downloaded and distributed. Click Continue.
8. The Host Inspector runs and displays the CDH version on the hosts. Click Continue. The Shut down and
upgrade the cluster screen displays.
9. Choose the type of upgrade and restart:
• Cloudera Manager upgrade - Cloudera Manager performs all service upgrades and restarts the cluster.
1. Click Continue. The Command Progress screen displays the result of the commands run by the wizard
as it shuts down all services, activates the new parcel, upgrades services as necessary, deploys client
configuration files, and restarts services.
2. Click Continue. The wizard reports the result of the upgrade.
• Manual upgrade - Select the Let me upgrade the cluster checkbox. Cloudera Manager configures the
cluster to the specified CDH version but performs no upgrades or service restarts. Manually doing the
upgrade is difficult and is for advanced users only.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
4. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
a. Stop the Hue service.
b. Copy the backup from the temporary location to the newly created Hue database directory
/opt/cloudera/parcels/CDH/share/hue/desktop.
c. Start the Hue service.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Required Role:
If you installed or upgraded to CDH 5 using packages, you can upgrade to CDH 5.1 using either packages or
parcels. Using parcels is recommended, because the upgrade wizard for parcels handles the upgrade almost
completely automatically.
To upgrade CDH using packages, the steps are as follows.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• Run hbase hbck.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
• Whirr
For information on upgrading these unmanaged components, see Upgrading Mahout on page 353, Upgrading
Pig on page 374, and Upgrading Whirr on page 417.
• On SLES systems:
1. Run the following command:
2. Edit the repo file to point to the release you want to install or upgrade to.
• On Red Hat-compatible systems:
Open the repo file you have just saved and change the 5 at the end of the line that begins baseurl= to
the version number you want.
For example, if you have saved the file for Red Hat 6, it will look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
• On SLES systems:
Open the repo file that you have just added to your system and change the 5 at the end of the line that
begins baseurl= to the version number you want.
The file should look like this when you open it for editing:
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
baseurl= http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
[cloudera-cdh5]
name=Cloudera's Distribution for Hadoop, Version 5
baseurl=http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/5.1.0/
gpgkey = http://archive.cloudera.com/cdh5/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
gpgcheck = 1
– Red Hat/CentOS/Oracle 6
• SLES
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
– Debian Wheezy
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• SLES
Note: Installing these packages will also install all the other CDH packages that are needed for a
full CDH 5 installation.
• Manual upgrade - Select the Let me upgrade the cluster checkbox. Cloudera Manager configures the
cluster to the specified CDH version but performs no upgrades or service restarts. Manually doing the
upgrade is difficult and is for advanced users only.
1. Click Continue. Cloudera Manager displays links to documentation describing the required upgrade
steps.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrade Spark
1. Go to the Spark service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upload Spark Jar and click Upload Spark Jar to confirm.
4. Select Actions > Create Spark History Log Dir and click Create Spark History Log Dir to confirm.
Important:
• You cannot perform a rolling upgrade from CDH 4 to CDH 5. There are incompatibilities between
the major versions, so a rolling restart is not possible. Rolling upgrade is also not supported from
CDH 5 Beta 2 to CDH 5.
• If you have just upgraded to Cloudera Manager 5, you must hard restart the Cloudera Manager
Agents as described in the (Optional) Deploy a Cloudera Manager Agent Upgrade on page 473.
• HBase - After you upgrade you must recompile all HBase coprocessor and custom JARs.
• Impala
– If you upgrade to CDH 5.1, Impala will be upgraded to 1.4.1. See New Features in Impala for
information about Impala 1.4.x features.
– If you upgrade to CDH 5.0, Impala will be upgraded to 1.3.2. If you have CDH 4 installed with
Impala 1.4.0, Impala will be downgraded to Impala 1.3.2. See New Features in Impala for
information about Impala 1.3 features.
• MapReduce and YARN
– In a Cloudera Manager deployment of a CDH 4 cluster, the MapReduce service is the default
MapReduce computation framework.You can create a YARN service in a CDH 4 cluster, but it
is not considered production ready.
– In a Cloudera Manager deployment of a CDH 5 cluster, the YARN service is the default
MapReduce computation framework.In CDH 5, the MapReduce service has been deprecated.
However, the MapReduce service is fully supported for backward compatibility through the
CDH 5 lifecycle.
– For production uses, Cloudera recommends that only one MapReduce framework should be
running at any given time. If development needs or other use case requires switching between
MapReduce and YARN, both services can be configured at the same time, but only one should
be in a running (to fully optimize the hardware resources available).
For information on migrating from MapReduce to YARN, see Managing MapReduce and YARN.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• When upgrading from CDH 4 to CDH 5, Oozie upgrade can take a very long time. For upgrades from CDH 4.3
and higher, you can reduce this time by reducing the amount of history Oozie retains. To reduce Oozie history:
1. Go to the Oozie service.
2. Click the Configuration tab.
3. Click Category > Advanced.
4. In Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml, enter the following
<property>
<name>oozie.service.PurgeService.older.than</name>
<value>7</value>
</property>
<property>
<name>oozie.service.PurgeService.purge.limit</name>
<value>1000</value>
</property>
STARTED Purge to purge Workflow Jobs older than [7] days, Coordinator Jobs older
than [7] days, and Bundlejobs older than [7] days.
ENDED Purge deleted [x] workflows, [y] coordinatorActions, [z] coordinators,
[w] bundles
9. Revert the purge service and log level settings to the default.
• If Using MySQL as Hue Backend: You may face issues after the upgrade if the default engine for MySQL
doesn't match the engine used by the Hue tables. To confirm the match:
1. Open the my.cnf file for MySQL, search for "default-storage-engine" and note its value.
2. Connect to MySQL and run the following commands:
use hue;
show create table auth_user;
3. Search for the "ENGINE=" line and confirm that its value matches the one for the
"default-storage-engine" above.
If the default engines do not match, Hue will display a warning on its start-up page
(http://$HUE_HOST:$HUE_PORT/about). Work with your database administrator to convert the current
Hue MySQL tables to the engine in use by MySQL, as noted by the "default-storage-engine" property.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• If using HBase:
– Run hbase hbck.
– Before you can upgrade HBase from CDH 4 to CDH 5, your HFiles must be upgraded from HFile v1 format
to HFile v2, because CDH 5 no longer supports HFile v1. The upgrade procedure itself is different if you
are using Cloudera Manager or the command line, but has the same results. The first step is to check for
instances of HFile v1 in the HFiles and mark them to be upgraded to HFile v2, and to check for and report
about corrupted files or files with unknown versions, which need to be removed manually. The next step
is to rewrite the HFiles during the next major compaction. After the HFiles are upgraded, you can continue
the upgrade. After the upgrade is complete, you must recompile custom coprocessors and JARs.To check
and upgrade the files:
1. In the Cloudera Admin Console, go to the HBase service and run Actions > Check HFile Version.
2. Check the output of the command in the stderr log.
Your output should be similar to the following:
Tables Processed:
hdfs://localhost:41020/myHBase/.META.
hdfs://localhost:41020/myHBase/usertable
hdfs://localhost:41020/myHBase/TestTable
hdfs://localhost:41020/myHBase/t
Count of HFileV1: 2
HFileV1:
hdfs://localhost:41020/myHBase/usertable
/fa02dac1f38d03577bd0f7e666f12812/family/249450144068442524
hdfs://localhost:41020/myHBase/usertable
/ecdd3eaee2d2fcf8184ac025555bb2af/family/249450144068442512
In the example above, you can see that the script has detected two HFile v1 files, one corrupt file and
the regions to major compact.
3. Trigger a major compaction on each of the reported regions. This major compaction rewrites the files
from HFile v1 to HFile v2 format. To run the major compaction, start HBase Shell and issue the
major_compact command.
$ /usr/lib/hbase/bin/hbase shell
hbase> major_compact 'usertable'
You can also do this in a single step by using the echo shell built-in command.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
Required Role:
This topic covers upgrading a CDH 4 cluster to a CDH 5 cluster using the upgrade wizard, which will install CDH
5 parcels. Your CDH 4 cluster can be using either parcels or packages; you can use the cluster upgrade wizard
to upgrade using parcels in either case.
If you want to upgrade using CDH 5 packages, you can do so using a manual process. See Upgrading from CDH
4 Packages to CDH 5 Packages on page 555.
The steps to upgrade a CDH installation managed by Cloudera Manager using parcels are as follows.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• When upgrading from CDH 4 to CDH 5, Oozie upgrade can take a very long time. For upgrades from CDH 4.3
and higher, you can reduce this time by reducing the amount of history Oozie retains. To reduce Oozie history:
1. Go to the Oozie service.
2. Click the Configuration tab.
3. Click Category > Advanced.
4. In Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml, enter the following
<property>
<name>oozie.service.PurgeService.older.than</name>
<value>7</value>
</property>
<property>
<name>oozie.service.PurgeService.purge.limit</name>
<value>1000</value>
</property>
STARTED Purge to purge Workflow Jobs older than [7] days, Coordinator Jobs older
than [7] days, and Bundlejobs older than [7] days.
ENDED Purge deleted [x] workflows, [y] coordinatorActions, [z] coordinators,
[w] bundles
9. Revert the purge service and log level settings to the default.
• If Using MySQL as Hue Backend: You may face issues after the upgrade if the default engine for MySQL
doesn't match the engine used by the Hue tables. To confirm the match:
1. Open the my.cnf file for MySQL, search for "default-storage-engine" and note its value.
2. Connect to MySQL and run the following commands:
use hue;
show create table auth_user;
3. Search for the "ENGINE=" line and confirm that its value matches the one for the
"default-storage-engine" above.
If the default engines do not match, Hue will display a warning on its start-up page
(http://$HUE_HOST:$HUE_PORT/about). Work with your database administrator to convert the current
Hue MySQL tables to the engine in use by MySQL, as noted by the "default-storage-engine" property.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• If using HBase:
– Run hbase hbck.
– Before you can upgrade HBase from CDH 4 to CDH 5, your HFiles must be upgraded from HFile v1 format
to HFile v2, because CDH 5 no longer supports HFile v1. The upgrade procedure itself is different if you
are using Cloudera Manager or the command line, but has the same results. The first step is to check for
instances of HFile v1 in the HFiles and mark them to be upgraded to HFile v2, and to check for and report
about corrupted files or files with unknown versions, which need to be removed manually. The next step
is to rewrite the HFiles during the next major compaction. After the HFiles are upgraded, you can continue
the upgrade. After the upgrade is complete, you must recompile custom coprocessors and JARs.To check
and upgrade the files:
1. In the Cloudera Admin Console, go to the HBase service and run Actions > Check HFile Version.
2. Check the output of the command in the stderr log.
Your output should be similar to the following:
Tables Processed:
hdfs://localhost:41020/myHBase/.META.
hdfs://localhost:41020/myHBase/usertable
hdfs://localhost:41020/myHBase/TestTable
hdfs://localhost:41020/myHBase/t
Count of HFileV1: 2
HFileV1:
hdfs://localhost:41020/myHBase/usertable
/fa02dac1f38d03577bd0f7e666f12812/family/249450144068442524
hdfs://localhost:41020/myHBase/usertable
/ecdd3eaee2d2fcf8184ac025555bb2af/family/249450144068442512
In the example above, you can see that the script has detected two HFile v1 files, one corrupt file and
the regions to major compact.
3. Trigger a major compaction on each of the reported regions. This major compaction rewrites the files
from HFile v1 to HFile v2 format. To run the major compaction, start HBase Shell and issue the
major_compact command.
$ /usr/lib/hbase/bin/hbase shell
hbase> major_compact 'usertable'
You can also do this in a single step by using the echo shell built-in command.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
b. Click Stop to confirm. The Command Details window shows the progress of stopping the roles.
c. When Command completed with n/n successful subcommands appears, the task is complete. Click Close.
since each directory is a complete copy. For example, if the data directory is /mnt/hadoop/hdfs/name,
do the following as root:
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running.
Repeat the preceding steps, starting by shutting down the CDH services.
3. Restart all the Cloudera Manager Agents to force an update of the installed binaries reported by the Agent.
On each host:
4. Run the Host Inspector to verify that the packages have been removed:
a. Click Hosts tab and then click the Host Inspector button.
b. When the command completes, click Show Inspector Results.
5. If your Hue service uses the embedded SQLite DB, restore the DB you backed up:
The actions performed by the upgrade wizard are listed in Upgrade Wizard Actions on page 554. If any of the
steps in the Command Progress screen fails, complete the step as described in that section before proceeding.
Recompile JARs
• MapReduce and YARN - Recompile JARs used in MapReduce applications. For further information, see For
MapReduce Programmers: Writing and Running Jobs on page 182.
• HBase - Recompile coprocessor and custom JARs used by HBase applications.
Upgrade HBase
1. Go to the HBase service.
2. Select Actions > Upgrade HBase and click Upgrade HBase to confirm.
Upgrade Oozie
1. Go to the Oozie service.
2. Select Actions > Upgrade Database and click Upgrade Database to confirm.
3. Start the Oozie service.
4. Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Required Role:
If you originally used Cloudera Manager to install CDH using packages, you can upgrade to CDH 5 either using
packages or parcels. Parcels is the preferred and recommended way to upgrade, as the upgrade wizard provided
for parcels handles the upgrade process almost completely automatically.
The steps to upgrade a CDH installation managed by Cloudera Manager using packages are as follows.
• Make sure there are no Oozie workflows in RUNNING or SUSPENDED status; otherwise the Oozie database
upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows.
• When upgrading from CDH 4 to CDH 5, Oozie upgrade can take a very long time. For upgrades from CDH 4.3
and higher, you can reduce this time by reducing the amount of history Oozie retains. To reduce Oozie history:
1. Go to the Oozie service.
2. Click the Configuration tab.
3. Click Category > Advanced.
4. In Oozie Server Advanced Configuration Snippet (Safety Valve) for oozie-site.xml, enter the following
<property>
<name>oozie.service.PurgeService.older.than</name>
<value>7</value>
</property>
<property>
<name>oozie.service.PurgeService.purge.limit</name>
<value>1000</value>
</property>
STARTED Purge to purge Workflow Jobs older than [7] days, Coordinator Jobs older
than [7] days, and Bundlejobs older than [7] days.
ENDED Purge deleted [x] workflows, [y] coordinatorActions, [z] coordinators,
[w] bundles
9. Revert the purge service and log level settings to the default.
• If Using MySQL as Hue Backend: You may face issues after the upgrade if the default engine for MySQL
doesn't match the engine used by the Hue tables. To confirm the match:
1. Open the my.cnf file for MySQL, search for "default-storage-engine" and note its value.
2. Connect to MySQL and run the following commands:
use hue;
show create table auth_user;
3. Search for the "ENGINE=" line and confirm that its value matches the one for the
"default-storage-engine" above.
If the default engines do not match, Hue will display a warning on its start-up page
(http://$HUE_HOST:$HUE_PORT/about). Work with your database administrator to convert the current
Hue MySQL tables to the engine in use by MySQL, as noted by the "default-storage-engine" property.
• Check your SQL against new Impala keywords whenever upgrading Impala, whether Impala is in CDH or a
standalone parcel or package.
• Run the Host Inspector and fix every issue.
• If using security, run the Security Inspector.
• Run hdfs fsck / and hdfs dfsadmin -report and fix every issue.
• If using HBase:
– Run hbase hbck.
– Before you can upgrade HBase from CDH 4 to CDH 5, your HFiles must be upgraded from HFile v1 format
to HFile v2, because CDH 5 no longer supports HFile v1. The upgrade procedure itself is different if you
are using Cloudera Manager or the command line, but has the same results. The first step is to check for
instances of HFile v1 in the HFiles and mark them to be upgraded to HFile v2, and to check for and report
about corrupted files or files with unknown versions, which need to be removed manually. The next step
is to rewrite the HFiles during the next major compaction. After the HFiles are upgraded, you can continue
the upgrade. After the upgrade is complete, you must recompile custom coprocessors and JARs.To check
and upgrade the files:
1. In the Cloudera Admin Console, go to the HBase service and run Actions > Check HFile Version.
2. Check the output of the command in the stderr log.
Your output should be similar to the following:
Tables Processed:
hdfs://localhost:41020/myHBase/.META.
hdfs://localhost:41020/myHBase/usertable
hdfs://localhost:41020/myHBase/TestTable
hdfs://localhost:41020/myHBase/t
Count of HFileV1: 2
HFileV1:
hdfs://localhost:41020/myHBase/usertable
/fa02dac1f38d03577bd0f7e666f12812/family/249450144068442524
hdfs://localhost:41020/myHBase/usertable
/ecdd3eaee2d2fcf8184ac025555bb2af/family/249450144068442512
In the example above, you can see that the script has detected two HFile v1 files, one corrupt file and
the regions to major compact.
3. Trigger a major compaction on each of the reported regions. This major compaction rewrites the files
from HFile v1 to HFile v2 format. To run the major compaction, start HBase Shell and issue the
major_compact command.
$ /usr/lib/hbase/bin/hbase shell
hbase> major_compact 'usertable'
You can also do this in a single step by using the echo shell built-in command.
• Review the upgrade procedure and reserve a maintenance window with enough time allotted to perform all
steps. For production clusters, Cloudera recommends allocating up to a full day maintenance window to
perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop
and Linux, and the particular hardware you are using.
• To avoid lots of alerts during the upgrade process, you can enable maintenance mode on your cluster before
you start the upgrade. This will stop email alerts and SNMP traps from being sent, but will not stop checks
and configuration validations from being made. Be sure to exit maintenance mode when you have finished
the upgrade in order to re-enable Cloudera Manager alerts.
When All services successfully stopped appears, the task is complete and you can close the Command
Details window.
b. Click Stop to confirm. The Command Details window shows the progress of stopping the roles.
c. When Command completed with n/n successful subcommands appears, the task is complete. Click Close.
# cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running.
Repeat the preceding steps, starting by shutting down the CDH services.
Uninstall CDH 4
Uninstall CDH 4 on each host as follows:
Important:
• Before removing the files, make sure you have not added any custom entries that you want to
preserve. (To preserve custom entries, back up the files before removing them.)
• Make sure you remove Impala and Search repository files, as well as the CDH repository file.
• Red Hat/CentOS/Oracle 6
• Red Hat/CentOS/Oracle 6
Note: Installing these packages also installs all the other CDH packages required for a full CDH
5 installation.
• SLES
1. Download and install the "1-click Install" package.
a. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (for example,
your home directory).
b. Install the RPM:
Note: Installing these packages also installs all the other CDH packages required for a full CDH
5 installation.
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
• Ubuntu Precise
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
Note: Installing these packages also installs all the other CDH packages required for a full CDH
5 installation.
• Leave OK, set up YARN and import existing configuration from my MapReduce service checked.
1. Click Continue to proceed. Cloudera Manager stops the YARN service (if running) and its dependencies.
2. Click Continue to proceed. The next page indicates some additional configuration required by YARN.
3. Verify or modify the configurations and click Continue. The Switch Cluster to MR2 step proceeds.
4. When all steps have completed, click Continue.
• Deselect OK, set up YARN and import existing configuration from my MapReduce service.
11. Click Finish to return to the Home page.
12. (Optional) Remove the MapReduce service.
a.
In the MapReduce row, right-click and select Delete. Click Delete to confirm.
The actions performed by the upgrade wizard are listed in Upgrade Wizard Actions on page 563. If any of the
steps in the Command Progress screen fails, complete the step as described in that section before proceeding.
Recompile JARs
• MapReduce and YARN - Recompile JARs used in MapReduce applications. For further information, see For
MapReduce Programmers: Writing and Running Jobs on page 182.
• HBase - Recompile coprocessor and custom JARs used by HBase applications.
Upgrade HBase
1. Go to the HBase service.
2. Select Actions > Upgrade HBase and click Upgrade HBase to confirm.
Upgrade Oozie
1. Go to the Oozie service.
2. Select Actions > Upgrade Database and click Upgrade Database to confirm.
3. Start the Oozie service.
4. Select Actions > Install Oozie ShareLib and click Install Oozie ShareLib to confirm.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrading CDH 4
Use the instructions in this section to upgrade to a higher CDH 4 minor release, that is from CDH 4.a.x to CDH
4. b.y. For example, CDH 4.6.0 to CDH 4.7.1.
You can upgrade to CDH 4.1.3 (or later) within the Cloudera Manager Admin Console, using parcels and an upgrade
wizard. This vastly simplifies the upgrade process. In addition, using parcels enables Cloudera Manager to
automate the deployment and rollback of CDH versions. Electing to upgrade using packages means that future
upgrades and rollbacks will still need to be done manually. Upgrading to a CDH 4 release prior to CDH 4.1.3 is
possible using packages, though upgrading to a more current release is strongly recommended.
If you use parcels, have a Cloudera Enterprise license, and have enabled HDFS high availability, you can perform
a rolling upgrade that lets you avoid cluster downtime.
Important: The following instructions describe how to upgrade from a CDH 4 release to a newer CDH
4 release in a Cloudera Manager deployment. If you are running CDH 3, you must upgrade to CDH 4
using the instructions at Upgrading CDH 3 to CDH 4 in a Cloudera Managed Deployment.
To upgrade from CDH 4 to CDH 5, see Upgrading CDH 4 to CDH 5 on page 546.
Upgrade Procedures
Important:
• Impala - If you have CDH 4.1.x with Cloudera Impala installed, and you plan to upgrade to CDH 4.2
or later, you must also upgrade Impala to version 1.2.1 or later. With a parcel installation you can
download and activate both parcels before you proceed to restart the cluster. You will need to
change the remote parcel repo URL to point to the location of the released product as described
in the upgrade procedures referenced below.
• HBase - In CDH 4.1.x, an HBase table could have an owner that had full administrative permissions
on the table. The owner construct was removed as of CDH 4.2.0, and the code now relies exclusively
on entries in the ACL table. Since table owners do not have an entry in this table, their permissions
are removed on upgrade from CDH 4.1.x to CDH 4.2.0 or later. If you are upgrading from CDH 4.1.x
to CDH 4.2 or later, and using HBase, you must add permissions for HBase owner users to the
HBase ACL table before you perform the upgrade. See the Known Issues in the CDH 4 Release
Notes, specifically the item "Must explicitly add permissions for owner users before upgrading
from 4.1.x" in the Known Issues in Apache HBase section.
• Hive - Hive has undergone major version changes from CDH 4.0 to 4.1 and between CDH 4.1 and
4.2. (CDH 4.0 had Hive 0.8.0, CDH 4.1 used Hive 0.9.0, and 4.2 or later has 0.10.0). This requires you
to manually back up and upgrade the Hive metastore database when upgrading between major
Hive versions. If you are upgrading from a version of CDH 4 prior to CDH 4.2 to a newer CDH 4
version, you must follow the steps for upgrading the metastore included in the upgrade procedures
referenced below.
Required Role:
You can upgrade your CDH 4 cluster to a higher minor version of CDH 4 using parcels from within the Cloudera
Manager Admin Console. Your current CDH 4 cluster can have been installed with either parcels or packages.
The new version will use parcels.
The following procedure requires cluster downtime. If you use parcels, have a Cloudera Enterprise license, and
have enabled HDFS high availability, you can perform a rolling upgrade that lets you avoid cluster downtime.
Important:
• Impala - If you have CDH 4.1.x with Cloudera Impala installed, and you plan to upgrade to CDH 4.2
or later, you must also upgrade Impala to version 1.2.1 or later. With a parcel installation you can
download and activate both parcels before you proceed to restart the cluster. You will need to
change the remote parcel repo URL to point to the location of the released product as described
in the upgrade procedures referenced below.
• HBase - In CDH 4.1.x, an HBase table could have an owner that had full administrative permissions
on the table. The owner construct was removed as of CDH 4.2.0, and the code now relies exclusively
on entries in the ACL table. Since table owners do not have an entry in this table, their permissions
are removed on upgrade from CDH 4.1.x to CDH 4.2.0 or later. If you are upgrading from CDH 4.1.x
to CDH 4.2 or later, and using HBase, you must add permissions for HBase owner users to the
HBase ACL table before you perform the upgrade. See the Known Issues in the CDH 4 Release
Notes, specifically the item "Must explicitly add permissions for owner users before upgrading
from 4.1.x" in the Known Issues in Apache HBase section.
• Hive - Hive has undergone major version changes from CDH 4.0 to 4.1 and between CDH 4.1 and
4.2. (CDH 4.0 had Hive 0.8.0, CDH 4.1 used Hive 0.9.0, and 4.2 or later has 0.10.0). This requires you
to manually back up and upgrade the Hive metastore database when upgrading between major
Hive versions. If you are upgrading from a version of CDH 4 prior to CDH 4.2 to a newer CDH 4
version, you must follow the steps for upgrading the metastore included in the upgrade procedures
referenced below.
To upgrade your version of CDH using parcels, the steps are as follows.
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Important: Removing the Hue Common package will remove your Hue database; if you do not
back it up you may lose all your Hue user account information.
3. Restart all the Cloudera Manager Agents to force an update of the symlinks to point to the newly installed
components on each host:
Required Role:
If you originally used Cloudera Manager to install your CDH service using packages, you can upgrade to a higher
minor version of CDH 4 either using packages or parcels. Parcels is the preferred and recommended way to
upgrade, as the upgrade wizard provided for parcels handles the upgrade process almost completely automatically.
Important:
• Impala - If you have CDH 4.1.x with Cloudera Impala installed, and you plan to upgrade to CDH 4.2
or later, you must also upgrade Impala to version 1.2.1 or later. With a parcel installation you can
download and activate both parcels before you proceed to restart the cluster. You will need to
change the remote parcel repo URL to point to the location of the released product as described
in the upgrade procedures referenced below.
• HBase - In CDH 4.1.x, an HBase table could have an owner that had full administrative permissions
on the table. The owner construct was removed as of CDH 4.2.0, and the code now relies exclusively
on entries in the ACL table. Since table owners do not have an entry in this table, their permissions
are removed on upgrade from CDH 4.1.x to CDH 4.2.0 or later. If you are upgrading from CDH 4.1.x
to CDH 4.2 or later, and using HBase, you must add permissions for HBase owner users to the
HBase ACL table before you perform the upgrade. See the Known Issues in the CDH 4 Release
Notes, specifically the item "Must explicitly add permissions for owner users before upgrading
from 4.1.x" in the Known Issues in Apache HBase section.
• Hive - Hive has undergone major version changes from CDH 4.0 to 4.1 and between CDH 4.1 and
4.2. (CDH 4.0 had Hive 0.8.0, CDH 4.1 used Hive 0.9.0, and 4.2 or later has 0.10.0). This requires you
to manually back up and upgrade the Hive metastore database when upgrading between major
Hive versions. If you are upgrading from a version of CDH 4 prior to CDH 4.2 to a newer CDH 4
version, you must follow the steps for upgrading the metastore included in the upgrade procedures
referenced below.
To upgrade your version of CDH using packages, the steps are as follows.
• Use the cloudera.com repository that is added during a typical installation, only updating Cloudera
components. This limits the scope of updates to be completed, so the process takes less time, however this
process will not work if you created and used a custom repository. To install the new version, you can upgrade
from Cloudera's repository by adding an entry to your operating system's package management configuration
file. The repository location varies by operating system:
For example, under Red Hat, to upgrade from Cloudera's repository you can run commands such as the
following on the CDH host to update only CDH:
Note:
• cloudera-cdh4 is the name of the repository on your system; the name is usually in square
brackets on the first line of the repo file, in this example
/etc/yum.repos.d/cloudera-cdh4.repo:
• yum clean all cleans up yum's cache directories, ensuring that you download and install the
latest versions of the packages. – If your system is not up to date, and any underlying system
components need to be upgraded before this yum update can succeed, yum will tell you what
those are.
On a SLES system, use commands like this to clean cached repository information and then update only the
CDH components. For example:
To verify the URL, open the Cloudera repo file in /etc/zypp/repos.d on your system (for example
/etc/zypp/repos.d/cloudera-cdh4.repo) and look at the line beginning
baseurl=
After cleaning the cache, use one of the following upgrade commands to upgrade CDH.
Precise:
Lucid:
Squeeze:
• Use a custom repository. This process can be more complicated, but enables updating CDH components for
hosts that are not connected to the Internet. You can create your own repository, as described in Understanding
Custom Installation Solutions on page 135. Creating your own repository is necessary if you are upgrading a
cluster that does not have access to the Internet.
If you used a custom repository to complete the installation of your current files and now you want to update
using a custom repository, the details of the steps to complete the process are variable. In general, begin by
updating any existing custom repository that you will use with the installation files you wish to use. This
can be completed in a variety of ways. For example, you might use wget to copy the necessary installation
files. Once the installation files have been updated, use the custom repository you established for the initial
installation to update CDH.
OS Command
RHEL Ensure you have a custom repo that is configured to use your internal repository.
For example, if you could have custom repo file in /etc/yum.conf.d/ called
cdh_custom.repo in which you specified a local repository. In such a case, you
might use the following commands:
$ sudo yum clean all
$ sudo yum update 'cloudera-*'
SLES Use commands such as the following to clean cached repository information
and then update only the CDH components:
$ sudo zypper clean --all
$ sudo zypper up -r
http://internalserver.example.com/path_to_cdh_repo
Ubuntu or Debian Use a command that targets upgrade of your CDH distribution using the custom
repository specified in your apt configuration files. These files are typically either
the /etc/apt/apt.conf file or in various files in the /etc/apt/apt.conf.d/
directory. Information about your custom repository must be included in the
repo files. The general form of entries in Debian/Ubuntu is:
deb http://server.example.com/directory/ dist-name pool
For example, the entry for the default repo is:
deb http://us.archive.ubuntu.com/ubuntu/ precise universe
On a Debian/Ubuntu system, use commands such as the following to clean
cached repository information and then update only the CDH components:
$ sudo apt-get clean
$ sudo apt-get upgrade -t your_cdh_repo
Upgrade Sqoop
1. Go to the Sqoop service.
2. Select Actions > Stop and click Stop to confirm.
3. Select Actions > Upgrade Sqoop and click Upgrade Sqoop to confirm.
Upgrading CDH 3
Warning: Cloudera Manager 3 and CDH 3 have reached End of Maintenance (EOM) as of June 20,
2013. Cloudera does not support or provide patches for Cloudera Manager 3 and CDH 3 releases.
To upgrade CDH 3 to CDH 4 with Cloudera Manager 4, follow the instructions at Upgrading CDH 3 to CDH 4 in a
Cloudera Manager Deployment.
Important:
• If you use Cloudera Manager, do not use these command-line instructions.
• This information applies specifically to CDH 5.4.x. If you use an earlier version of CDH, see the
documentation for that version located at Cloudera Documentation.
Note: If you are using Cloudera Manager to manage CDH, do not use the instructions in this section.
• If you are running Cloudera Manager 4, you must upgrade to Cloudera Manager 5 first, as Cloudera
Manager 4 cannot manage CDH 5; see Upgrading Cloudera Manager on page 442.
• Follow directions in Upgrading CDH 4 to CDH 5 on page 546 to upgrade CDH 4 to CDH 5 in a Cloudera
Manager deployment.
Use the following information and instructions to upgrade to the latest CDH 5 release from a CDH 4 release:
• Before You Begin on page 573
• Upgrading to CDH 5 on page 574
Important: This involves uninstalling the CDH 4 packages and installing the CDH 5 packages.
Note:
If you are migrating from MapReduce v1 (MRv1) to MapReduce v2 (MRv2, YARN), see Migrating from
MapReduce 1 (MRv1) to MapReduce 2 (MRv2, YARN) on page 182 for important information and
instructions.
Warning:
It's particularly important that you read the Install and Upgrade Known Issues.
Plan Downtime
If you are upgrading a cluster that is part of a production system, be sure to plan ahead. As with any operational
work, be sure to reserve a maintenance window with enough extra time allotted in case of complications. The
Hadoop upgrade process is well understood, but it is best to be cautious. For production clusters, Cloudera
recommends allocating up to a full day maintenance window to perform the upgrade, depending on the number
of hosts, the amount of experience you have with Hadoop and Linux, and the particular hardware you are using.
Install Java 1.7
CDH 5 requires Java 1.7 or later. See Upgrading to Oracle JDK 1.7 before Upgrading to CDH 5 on page 609, and
make sure you have read the Install and Upgrade Known Issues before you proceed with the upgrade.
Delete Symbolic Links in HDFS
If there are symbolic links in HDFS when you upgrade from CDH 4 to CDH 5, the upgrade will fail and you will
have to downgrade to CDH 4, delete the symbolic links, and start over. To prevent this, proceed as follows.
3. Use a command such as the following to find the path names of any symbolic links listed in
/tmp/YYYY-MM-DD_FSIMAGE.txt and write them out to the file /tmp/symlinks.txt:
Important:
If you decide to configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
After completing the HDFS HA software configuration, follow the installation instructions under
Deploying HDFS High Availability.
• To upgrade an existing configuration, follow the instructions under Upgrading to CDH 5 on page 574.
Upgrading to CDH 5
Important:
1. To upgrade from CDH 4, you must uninstall CDH 4, and then install CDH 5. Make sure you allow
sufficient time for this, and do the necessary backup and preparation as described below.
2. If you have configured HDFS HA with NFS shared storage, do not proceed. This configuration is
not supported on CDH 5; Quorum-based storage is the only supported HDFS HA configuration on
CDH 5. Unconfigure your NFS shared storage configuration before you attempt to upgrade.
This will result in a new fsimage being written out with no edit log entries.
c. With the NameNode still in safe mode, shut down all services as instructed below.
2. For each component you are using, back up configuration data, databases, and other important files.
3. Shut down the Hadoop services across your entire cluster:
4. Check each host to make sure that there are no processes running as the hdfs or mapred users from root:
Important:
Do this step when you are sure that all Hadoop services have been shut down. It is particularly
important that the NameNode service is not running so that you can make a consistent backup.
Note:
• Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major
upgrade.
• dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This example
uses dfs.name.dir.
<property>
<name>dfs.name.dir</name>
<value>/mnt/hadoop/hdfs/name</value>
2. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you
see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back
up the first directory, for example, by using the following commands:
$ cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat
the preceding steps, starting by shutting down the Hadoop services.
Warning: Do not proceed before you have backed up the HDFS metadata, and the files and databases
for the individual components, as instructed in the previous steps.
To uninstall Hadoop:
Run this command on each host:
On Red Hat-compatible systems:
On SLES systems:
On Ubuntu systems:
Important:
• Before removing the files, make sure you have not added any custom entries that you want to
preserve. (To preserve custom entries, back up the files before removing them.)
• Make sure you remove Impala and Search repository files, as well as the CDH repository file.
Note:
For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see
Installing the Latest CDH 5 Release on page 166.
• Red Hat/CentOS/Oracle 6
This ensures that the system repositories contain the latest software (it does not actually install
anything).
On SLES systems:
1. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (for example, your
home directory).
2. Install the RPM:
This ensures that the system repositories contain the latest software (it does not actually install
anything).
This ensures that the system repositories contain the latest software (it does not actually install
anything).
$ curl -s http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
$ curl -s http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
Note: Skip this step and go to Install CDH 5 with MRv1 on page 580 if you intend to use only MRv1.
Important: Cloudera recommends that you install (or update) and start a ZooKeeper cluster before
proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or
JobTracker.
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-yarn-resourcemanager
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-mapreduce-historyserver
hadoop-yarn-proxyserver
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-client
Note: The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically
as dependencies of the other packages.
Note:
Skip this step if you intend to use only YARN. If you are installing both YARN and MRv1, you can skip
any packages you have already installed in Step 6a.
Note: If you are also installing YARN, you can skip any packages you have already installed in Step
6a.
Important: Cloudera recommends that you install (or update) and start a ZooKeeper cluster before
proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or
JobTracker.
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-jobtracker
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-tasktracker
hadoop-hdfs-datanode
Red Hat/CentOS compatible sudo yum clean all; sudo yum install
hadoop-client
$ cp /etc/hadoop/conf.empty/log4j.properties /etc/hadoop/conf.my_cluster/log4j.properties
2. Start the JournalNode daemons on each of the machines where they will run:
Wait for the daemons to start before proceeding to the next step.
Important: The JournalNodes must be up and running CDH 5 before you proceed.
Note:
What you do in this step differs depending on whether you are upgrading an HDFS HA deployment
using Quorum-based storage, or a non-HA deployment using a secondary NameNode. (If you have
an HDFS HA deployment using NFS storage, do not proceed; you cannot upgrade that configuration
to CDH 5. Unconfigure your NFS shared storage configuration before you attempt to upgrade.)
• For an HA deployment, do sub-steps 1, 2, and 3 below.
• For a non-HA deployment, do sub-steps 1, 3, and 4 below.
1. To upgrade the HDFS metadata, run the following command on the NameNode. If HA is enabled, do this on
the active NameNode only, and make sure the JournalNodes have been upgraded to CDH 5 and are up and
running before you run the command.
Important: In an HDFS HA deployment, it is critically important that you do this on only one
NameNode.
Look for a line that confirms the upgrade is complete, such as:
/var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
Note: The NameNode upgrade process can take a while depending on how many files you have.
For more information about the haadmin -failover command, see Administering an HDFS High Availability
Cluster.
3. Start up the DataNodes:
On each DataNode:
4. Do this step only in a non-HA configuration. Otherwise skip to starting YARN or MRv1.
Wait for NameNode to exit safe mode, and then start the Secondary NameNode.
a. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's
web interface, that say "...no longer in safe mode."
b. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode
host:
Important: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the
same time. This is not recommended; it will degrade your performance and may result in an unstable
MapReduce cluster deployment. Steps 10a and 10b are mutually exclusive.
After you have verified HDFS is operating correctly, you are ready to start YARN. First, create directories and set
the correct permissions.
Note: You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps
which is explicitly configured in the yarn-site.xml.
Note: Make sure you always start ResourceManager before starting NodeManager services.
On each NodeManager system (typically the same ones where DataNode service runs):
For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop
in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Note:
For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster.
1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
4. Run an example Hadoop job to grep with a regular expression in your input data.
5. After the job completes, you can find the output in the HDFS directory named output23 because you specified
that output directory to Hadoop.
$ hadoop fs -ls
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
Important: If you have client hosts, make sure you also update them to CDH 5, and upgrade the
components running on those clients as well.
Important: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the
same time. This is not recommended; it will degrade your performance and may result in an unstable
MapReduce cluster deployment. Steps 9a and 9b are mutually exclusive.
After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker
system:
If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start
and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly.
Important: For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or
running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable
as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce/
4. Run an example Hadoop job to grep with a regular expression in your input data.
5. After the job completes, you can find the output in the HDFS directory named output because you specified
that output directory to Hadoop.
$ hadoop fs -ls
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output
Important:
If you have client hosts, make sure you also update them to CDH 5, and upgrade the components
running on those clients as well.
Important:
During uninstall, the package manager renames any configuration files you have modified from
<file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with
applicable defaults. You are responsible for applying any changes captured in the original CDH 4
configuration file to the new CDH 5 configuration file. In the case of Ubuntu and Debian upgrades, a
file will not be installed if there is already a version of that file on the system, and you will be prompted
to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
For example, if you have modified your CDH 4 zoo.cfg configuration file (/etc/zookeeper.dist/zoo.cfg),
RPM uninstall and re-install (using yum remove) renames and preserves a copy of your modified zoo.cfg as
/etc/zookeeper.dist/zoo.cfg.rpmsave. You should compare this to the new
/etc/zookeeper/conf/zoo.cfg and resolve any differences that should be carried forward (typically where
you have changed property value defaults). Do this for each component you upgrade to CDH 5.
Finalize the HDFS Metadata Upgrade
To finalize the HDFS metadata upgrade you began earlier in this procedure, proceed as follows:
1. Make sure you are satisfied that the CDH 5 upgrade has succeeded and everything is running smoothly. To
determine when finalization is warranted, run important workloads and ensure they are successful. Once
you have finalized the upgrade, you cannot roll back to a previous version of HDFS without using backups.
• Make sure you are satisfied that the CDH 5 upgrade has succeeded and everything is running smoothly.
It is not unusual to wait days or even weeks before finalizing the upgrade. To determine when finalization
is warranted, run important workloads and ensure they are successful. Once you have finalized the
upgrade, you cannot roll back to a previous version of HDFS without using backups.
Warning:
Do not proceed until you are sure you are satisfied with the new deployment. Once you have
finalized the HDFS metadata, you cannot revert to an earlier version of HDFS.
Note:
• If you need to restart the NameNode during this period (after having begun the upgrade
process, but before you've run finalizeUpgrade) simply restart your NameNode without
the -upgrade option.
• Make sure you have plenty of free disk space if you believe it will take some time to verify
that you are ready to finalize the upgrade. Before you finalize:
– Deleting files will not in fact free up the space.
– Using the balancer will cause all replicas moved to actually be duplicated.
– The size of all on-disk data representing the NameNodes' metadata will be retained,
potentially doubling (or more than doubling) the amount of space required on the
NameNodes' and JournalNodes' disks.
• Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos
is enabled (see Configuring Hadoop Security in CDH 5).
Important:
In an HDFS HA deployment, make sure that both the NameNodes and all of the JournalNodes
are up and functioning normally before you proceed.
– If Kerberos is enabled:
Note:
• If you need to restart the NameNode during this period (after having begun the upgrade process,
but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade
option.
• Make sure you have plenty of free disk space if you believe it will take some time to verify that
you are ready to finalize the upgrade. Before you finalize:
– Deleting files will not in fact free up the space.
– Using the balancer will cause all replicas moved to actually be duplicated.
– The size of all on-disk data representing the NameNodes' metadata will be retained,
potentially doubling (or more than doubling) the amount of space required on the
NameNodes' and JournalNodes' disks.
2. Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos
is enabled (see Configuring Hadoop Security in CDH 5).
Important: In an HDFS HA deployment, make sure that both the NameNodes and all of the
JournalNodes are up and functioning normally before you proceed.
• If Kerberos is enabled:
Note: After the metadata upgrade completes, the previous/ and blocksBeingWritten/
directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.
Important:
• If you are using Cloudera Manager to manage CDH, do not use the instructions in this section.
Follow the directions under Upgrading CDH and Managed Services Using Cloudera Manager on
page 479 to upgrade to the latest version of CDH 5 in a Cloudera Manager deployment.
• The instructions in this section describe how to upgrade to the latest CDH 5 release from an earlier
CDH 5 release. If you are upgrading from a CDH 4 release, use the instructions under Upgrading
from CDH 4 to CDH 5 on page 573 instead.
• MapReduce v1 (MRv1) and MapReduce v2 (YARN): the sections that follow cover upgrade for
MapReduce v1 (MRv1) and MapReduce v2 (YARN). MapReduce MRv1 and YARN share a common
set of configuration files, so it is safe to configure both of them. Cloudera does not recommend
running MapReduce MRv1 and YARN daemons on the same hosts at the same time. If you want
to easily switch between MapReduce MRv1 and YARN, consider using Cloudera Manager features
for managing these services.
Important Tasks
• Upgrading from any earlier CDH release to CDH 5.4.0 or later requires an HDFS metadata upgrade.
• Upgrading from a release earlier than 5.2.0 requires all of the following:
– Upgrade HDFS metadata
– Upgrade the Sentry database
– Upgrade the Hive database
– Upgrade the Sqoop 2 database
Make sure you also do the following tasks that are required for every upgrade:
Note:
• Before upgrading, read about the latest Incompatible Changes and Known Issues and Workarounds
in CDH 5 in the CDH 5 Release Notes.
Warning:
It's particularly important that you read the Install and Upgrade Known Issues.
• If you are upgrading a cluster that is part of a production system, plan ahead. For production
clusters, Cloudera recommends allocating up to a full day maintenance window to perform the
upgrade, depending on the number of hosts, the amount of experience you have with Hadoop and
Linux, and the particular hardware you are using.
• The instructions in this section assume you are upgrading a multi-node cluster. If you are running a
pseudo-distributed (single-machine) cluster, Cloudera recommends that you copy your data off the cluster,
remove the old CDH release, install Hadoop from CDH 5, and then restore your data.
• If you have a multi-node cluster running an earlier version of CDH 5, use the appropriate instructions to
upgrade your cluster to the latest version:
– Upgrading to the Latest Release on page 592
Problem
The problem occurs when you try to upgrade the hadoop-kms package, for example:
/var/cache/zypp/packages/cdh/RPMS/x86_64/hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11.x86_64.rpm:
Header V4 DSA signature: NOKEY, key ID e8f86acd
12:54:19 error: %postun(hadoop-kms-2.5.0+cdh5.3.1+791-1.cdh5.3.1.p0.17.sles11.x86_64)
scriptlet failed, exit status 1
12:54:19
Note:
• The hadoop-kms package is not installed automatically with CDH, so you will encounter this error
only if you are explicitly upgrading an existing version of KMS.
• The examples in this section show an upgrade from CDH 5.3.x; the 5.2.x case looks very similar.
What to Do
If you see an error similar to the one in the example above, proceed as follows:
1. Abort, or ignore the error (it doesn't matter which):
2. Perform cleanup.
a. # rpm -qa hadoop-kms
You will see two versions of hadoop-kms; for example:
hadoop-kms-2.5.0+cdh5.3.1+791-1.cdh5.3.1.p0.17.sles11
hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11
3. Verify that the older version of the package has been removed:
hadoop-kms-2.5.0+cdh5.3.2+801-1.cdh5.3.2.p0.224.sles11
This will result in a new fsimage being written out with no edit log entries.
c. With the NameNode still in safe mode, shut down all services as instructed below.
2. Shut down Hadoop services across your entire cluster by running the following command on every host in
your cluster:
3. Check each host to make sure that there are no processes running as the hdfs, yarn, mapred or httpfs
users from root:
Important: When you are sure that all Hadoop services have been shut down, do the following
step. It is particularly important that the NameNode service is not running so that you can make
a consistent backup.
Note:
• Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major
upgrade.
• dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This
example uses dfs.name.dir.
b. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If
you see a comma-separated list of paths, there is no need to back up all of them; they store the same
data. Back up the first directory, for example, by using the following commands:
$ cd /mnt/hadoop/hdfs/name
# tar -cvf /root/nn_backup_data.tar .
./
./current/
./current/fsimage
./current/fstime
./current/VERSION
./current/edits
./image/
./image/fsimage
Important: If you see a file containing the word lock, the NameNode is probably still running.
Repeat the preceding steps from the beginning; start at Step 1 and shut down the Hadoop
services.
Step 2: If Necessary, Download the CDH 5 "1-click"Package on Each of the Hosts in your Cluster
Before you begin: Check whether you have the CDH 5 "1-click" repository installed.
rpm -q cdh5-repository
If you are upgrading from CDH 5 Beta 1 or later, you should see:
cdh5-repository-1-0
If the repository is installed, skip to Step 3; otherwise proceed with these instructions.
Summary: If the CDH 5 "1-click" repository is not already installed on each host in the cluster, follow the
instructions below for that host's operating system:
• Instructions for Red Hat-compatible systems
• Instructions for SLES systems
• Instructions for Ubuntu and Debian systems
• Red Hat/CentOS/Oracle 6
Note:
For instructions on how to add a CDH 5 yum repository or build your own CDH 5 yum repository, see
Installing CDH 5 On Red Hat-compatible systems.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
On SLES systems:
1. Download the CDH 5 "1-click Install" package.
Click this link, choose Save File, and save it to a directory to which you have write access (for example, your
home directory).
2. Install the RPM:
Note:
For instructions on how to add a repository or build your own repository, see Installing CDH 5 on SLES
Systems.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
• Choose Save File, save the package to a directory to which you have write access (for example, your home
directory), and install it from the command line. For example:
Note:
For instructions on how to add a repository or build your own repository, see the instructions on
installing CDH 5 on Ubuntu and Debian systems.
This ensures that the system repositories contain the latest software (it does not actually install
anything).
Note:
• Remember that you can install and configure both MRv1 and YARN, but you should not run them
both on the same set of nodes at the same time.
• If you are using HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
Before installing MRv1 or YARN: (Optionally) add a repository key on each system in the cluster, if you have not
already done so. Add the Cloudera Public GPG Key to your repository by executing one of the following commands:
• For Red Hat/CentOS/Oracle 5 systems:
$ curl -s
http://archive.cloudera.com/cdh5/ubuntu/precise/amd64/cdh/archive.key
| sudo apt-key add -
$ curl -s
http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key
| sudo apt-key add -
Step 3a: If you are using MRv1, upgrade the MRv1 packages on the appropriate hosts.
Skip this step if you are using YARN exclusively. Otherwise upgrade each type of daemon package on the
appropriate hosts as follows:
1. Install and deploy ZooKeeper:
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding.
This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-jobtracker
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-0.20-mapreduce-tasktracker
hadoop-hdfs-datanode
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-client
Step 3b: If you are using YARN, upgrade the YARN packages on the appropriate hosts.
Skip this step if you are using MRv1 exclusively. Otherwise upgrade each type of daemon package on the
appropriate hosts as follows:
1. Install and deploy ZooKeeper:
Important:
Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding.
This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-yarn-resourcemanager
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-hdfs-namenode
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-hdfs-secondarynamenode
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-yarn-nodemanager
hadoop-hdfs-datanode hadoop-mapreduce
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-mapreduce-historyserver
hadoop-yarn-proxyserver
Red Hat/CentOS compatible $ sudo yum clean all; sudo yum install
hadoop-client
Note:
The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as
dependencies of the other packages.
2. Start the JournalNode daemons on each of the machines where they will run:
Wait for the daemons to start before proceeding to the next step.
Important:
In an HA deployment, the JournalNodes must be up and running CDH 5 before you proceed.
Note:
What you do in this step differs depending on whether you are upgrading an HDFS HA deployment,
or a non-HA deployment using a secondary NameNode:
• For an HA deployment, do sub-steps 5a, 5b, and 5c.
• For a non-HA deployment, do sub-steps 5a, 5c, and 5d.
Important:
In an HDFS HA deployment, it is critically important that you do this on only one NameNode.
Look for a line that confirms the upgrade is complete, such as:
/var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
Note:
The NameNode upgrade process can take a while, depending on how many files you have.
Step 5b: Do this step only in an HA deployment. Otherwise skip to starting up the DataNodes.
Wait for NameNode to exit safe mode, and then re-start the standby NameNode.
• If Kerberos is enabled:
Step 5d: Do this step only in a non-HA deployment. Otherwise skip to starting YARN or MRv1.
Wait for NameNode to exit safe mode, and then start the Secondary NameNode.
1. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web
interface, that say "...no longer in safe mode."
2. To start the Secondary NameNode, enter the following command on the Secondary NameNode host:
Important:
Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This
is not recommended; it will degrade performance and may result in an unstable MapReduce cluster
deployment. Steps 6a and 6b are mutually exclusive.
After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker
system:
If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start
and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly.
Verify basic cluster operation for MRv1.
At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic
cluster operation by running an example from the Apache Hadoop web site.
Note:
For important configuration information, see Deploying MapReduce v1 (MRv1) on a Cluster.
1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
3. Run an example Hadoop job to grep with a regular expression in your input data.
4. After the job completes, you can find the output in the HDFS directory named output because you specified
that output directory to Hadoop.
$ hadoop fs -ls
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output
Important:
If you have client hosts, make sure you also update them to CDH 5, and upgrade the components
running on those clients as well.
Important:
Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This
is not recommended; it will degrade your performance and may result in an unstable MapReduce
cluster deployment. Steps 6a and 6b are mutually exclusive.
After you have verified HDFS is operating correctly, you are ready to start YARN. First, if you have not already
done so, create directories and set the correct permissions.
Note: You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps
which is explicitly configured in the yarn-site.xml.
Note:
Make sure you always start ResourceManager before starting NodeManager services.
On each NodeManager system (typically the same ones where DataNode service runs):
For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop
1 in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Note:
For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster.
1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
4. Run an example Hadoop job to grep with a regular expression in your input data.
After the job completes, you can find the output in the HDFS directory named output23 because you specified
that output directory to Hadoop.
$ hadoop fs -ls
Found 2 items
drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input
drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
Important:
If you have client hosts, make sure you also update them to CDH 5, and upgrade the components
running on those clients as well.
Note:
• For important information on new and changed components, see the CDH 5 Release Notes. To
see whether there is a new version of a particular component in CDH 5, check the CDH Version
and Packaging Information.
• Cloudera recommends that you regularly update the software on each system in the cluster (for
example, on a RHEL-compatible system, regularly run yum update) to ensure that all the
dependencies for any given component are up to date. (If you have not been in the habit of doing
this, be aware that the command may take a while to run the first time you use it.)
CDH 5 Components
Use the following sections to install or upgrade CDH 5 components:
• Crunch Installation on page 227
• Flume Installation on page 228
• HBase Installation on page 239
• HCatalog Installation on page 271
• Hive Installation on page 291
• HttpFS Installation on page 319
• Hue Installation on page 322
• Impala Installation on page 277
• KMS Installation on page 351
• Mahout Installation on page 352
• Oozie Installation on page 355
• Pig Installation on page 374
• Search Installation on page 378
• Sentry Installation on page 391
• Snappy Installation on page 392
• Spark Installation on page 394
• Sqoop 1 Installation on page 404
• Sqoop 2 Installation on page 409
• Whirr Installation on page 417
• ZooKeeper Installation
See also the instructions for installing or updating LZO.
Step 9: Apply Configuration File Changes if Necessary
For example, if you have modified your zoo.cfg configuration file (/etc/zookeeper/zoo.cfg), the upgrade
renames and preserves a copy of your modified zoo.cfg as /etc/zookeeper/zoo.cfg.rpmsave. If you have
not already done so, you should now compare this to the new /etc/zookeeper/conf/zoo.cfg, resolve
differences, and make any changes that should be carried forward (typically where you have changed property
value defaults). Do this for each component you upgrade.
Step 10: Finalize the HDFS Metadata Upgrade
To finalize the HDFS metadata upgrade you began earlier in this procedure, proceed as follows:
• Make sure you are satisfied that the CDH 5 upgrade has succeeded and everything is running smoothly. It
is not unusual to wait days or even weeks before finalizing the upgrade. To determine when finalization is
warranted, run important workloads and ensure they are successful. Once you have finalized the upgrade,
you cannot roll back to a previous version of HDFS without using backups.
Warning:
Do not proceed until you are sure you are satisfied with the new deployment. Once you have
finalized the HDFS metadata, you cannot revert to an earlier version of HDFS.
Note:
• If you need to restart the NameNode during this period (after having begun the upgrade process,
but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade
option.
• Make sure you have plenty of free disk space if you believe it will take some time to verify that
you are ready to finalize the upgrade. Before you finalize:
– Deleting files will not in fact free up the space.
– Using the balancer will cause all replicas moved to actually be duplicated.
– The size of all on-disk data representing the NameNodes' metadata will be retained,
potentially doubling (or more than doubling) the amount of space required on the
NameNodes' and JournalNodes' disks.
• Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos
is enabled (see Configuring Hadoop Security in CDH 5).
Important:
In an HDFS HA deployment, make sure that both the NameNodes and all of the JournalNodes are
up and functioning normally before you proceed.
– If Kerberos is enabled:
Note:
After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in
the DataNodes' data directories aren't cleared until the DataNodes are restarted.
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
2014-10-16 18:36:29,126 INFO org.mortbay.log: Stopped
[email protected]:50070
2014-10-16 18:36:29,127 WARN org.apache.hadoop.http.HttpServer2: HttpServer
Acceptor: isRunning is false. Rechecking.
2014-10-16 18:36:29,127 WARN org.apache.hadoop.http.HttpServer2: HttpServer
Acceptor: isRunning is false
2014-10-16 18:36:29,127 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping NameNode metrics system...
2014-10-16 18:36:29,128 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NameNode metrics system stopped.
2014-10-16 18:36:29,128 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NameNode metrics system shutdown complete.
2014-10-16 18:36:29,128 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode:
Exception in namenode join
java.io.IOException: File system image contains an old layout version -55.An
upgrade to version -59 is required.
Please restart NameNode with the "-rollingUpgrade started" option if a rolling
upgrade is already
started; or restart NameNode with the "-upgrade" option to start a new upgrade.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
2014-10-16 18:36:29,130 INFO org.apache.hadoop.util.ExitUtil: Exiting with
status 1
Warning: Cloudera does not support upgrading to JDK 1.7 while upgrading a cluster to CDH 5. You
must upgrade the JDK, then upgrade the cluster to CDH 5.
The process for upgrading to Oracle JDK 1.7 varies depending on whether you have a Cloudera Manager
Deployment on page 40 or an Unmanaged Deployment on page 41.
Warning: Upgrading the JDK in an unmanaged deployment requires shutting down the cluster.
1. Shut down the cluster, following directions in the documentation for the CDH 4 release you are currently
running.
2. Clean up existing JDK versions.
3. On each cluster host:
a. Install the same supported version of JDK 1.7. See Java Development Kit Installation on page 41 for
instructions.
b. Verify that you have set JAVA_HOME on each host to the directory where you installed JDK 1.7, as
instructed.
4. Start the CDH upgrade.
Warning:
• Cloudera does not support upgrading to JDK 1.8 while upgrading to Cloudera Manager 5.3. The
Cloudera Manager Server must upgraded to 5.3 before you start.
• Cloudera does not support upgrading to JDK 1.8 while upgrading a cluster to CDH 5.3. The cluster
must be running CDH 5.3 before you start.
• Cloudera does not support a rolling upgrade to JDK 1.8. You must shut down the entire cluster.
"Failed to start server" reported by You may have SELinux enabled. Disable SELinux by running sudo
cloudera-manager-installer.bin. setenforce 0 on the Cloudera
/var/log/cloudera-scm-server/cloudera-scm-server.log Manager Server host. To disable it
contains a message beginning permanently, edit
Caused by: /etc/selinux/config.
java.lang.ClassNotFoundException:
com.mysql.jdbc.Driver...
Installation interrupted and installer You need to do some manual See Uninstalling Cloudera Manager
won't restart. cleanup. and Managed Software on page 158.
Cloudera Manager Server fails to Tables may be configured with the Make sure that the InnoDB engine
start and the Server is configured to ISAM engine. The Server will not is configured, not the MyISAM
use a MySQL database to store start if its tables are configured with engine. To check what engine your
information about service the MyISAM engine, and an error tables are using, run the following
configuration. such as the following will appear in command from the MySQL shell:
the log file: mysql> show table status;
Tables ... have unsupported For more information, see MySQL
engine type ... . InnoDB is
required. Database on page 55.
Agents fail to connect to Server. You may have SELinux or iptables Check
Error 113 ('No route to host') in enabled. /var/log/cloudera-scm-server/cloudera-scm-server.log
/var/log/cloudera-scm-agent/cloudera-scm-agent.log. on the Server host and
/var/log/cloudera-scm-agent/cloudera-scm-agent.log
on the Agent hosts. Disable SELinux
and iptables.
Some cluster hosts do not appear You may have network connectivity • Make sure all cluster hosts have
when you click Find Hosts in install problems. SSH port 22 open.
or update wizard. • Check other common causes of
loss of connectivity such as
firewalls and interference from
SELinux.
"Access denied" in install or update Hostname mapping or permissions • For hostname configuration, see
wizard during database are incorrectly set up. Configuring Network Names
configuration for Activity Monitor or (CDH 4) or Configuring Network
Reports Manager. Names on page 201 (CDH 5).
grant all on
activity_monitor.* TO
'amon_user'@'myhost1.myco.com'
IDENTIFIED BY
'amon_password';
grant all on
activity_monitor.* TO
'amon_user'@'%'
IDENTIFIED BY
'amon_password';
Activity Monitor, Reports Manager, MySQL binlog format problem. Set binlog_format=mixed in
or Service Monitor databases fail to /etc/my.cnf. For more information,
start. see this MySQL bug report. See also
Cloudera Manager and Managed
Service Data Stores on page 44.
You have upgraded the Cloudera You may have mismatched versions Make sure you have upgraded the
Manager Server, but now cannot of the Cloudera Manager Server and Cloudera Manager Agents on all
start services. Agents. hosts. (The previous version of the
Agents will heartbeat with the new
version of the Server, but you can't
start HDFS and MapReduce with
this combination.)
Cloudera services fail to start. Java may not be installed or may be See Configuring a Custom Java
installed at a custom location. Home Location on page 141 for more
information on resolving this issue.
The Activity Monitor fails to start. The binlog_format is not set to Modify the mysql.cnf file to include
Logs contain the error mixed. the entry for binlog format as
read-committed isolation not specified in MySQL Database on
safe for the statement binlog page 55.
format.
Attempts to reinstall older versions It is possible to install, uninstall, and Clear information in the yum cache:
of CDH or Cloudera Manager using reinstall CDH and Cloudera Manager.
1. Connect to the CDH host.
yum fails. In certain cases, this does not
complete as expected. If you install 2. Execute either of the following
Cloudera Manager 5 and CDH 5, then commands: $ yum
--enablerepo='*'clean all
uninstall Cloudera Manager and
CDH, and then attempt to install or $ rm -rf
/var/cache/yum/cloudera*
CDH 4 and Cloudera Manager 4,
incorrect cached information may 3. After clearing the cache, proceed
result in the installation of an with installation.
incompatible version of the Oracle
JDK.
Hive, Impala, or Hue complains The Hive Metastore database must Follow the instructions in Upgrading
about a missing table in the Hive be upgraded after a major Hive CDH 4 on page 563 or Upgrading CDH
metastore database. version change (Hive had a major 4 to CDH 5 on page 546 for upgrading
version change in CDH 4.0, 4.1, 4.2, the Hive Metastore database
and 5.0). schema. Stop all Hive services
before performing the upgrade.
The Create Hive Metastore Database PostgreSQL versions 9 and later As the administrator user, use the
Tables command fails due to a require special configuration for Hive following command to turn
problem with an escape string. because of a backward-incompatible standard_conforming_strings
change in the default value of the off:
standard_conforming_strings ALTER DATABASE <hive_db_name>
property. Versions up to PostgreSQL SET
9.0 defaulted to off, but starting standard_conforming_strings
= off;
with version 9.0 the default is on.
After upgrading to CDH 5, HDFS HDFS caching, which is enabled by Do the following:
DataNodes fail to start with default in CDH 5, requires new
1. Stop all CDH and managed
exception: memlock functionality from
services.
Cloudera Manager Agents.
Exception in 2. On all hosts with Cloudera
secureMainjava.lang.RuntimeException: Manager Agents, hard restart the
Cannot start datanode Agents. Before performing this
because the configured max
locked memory size step, ensure you understand the
(dfs.datanode.max.locked.memory) semantics of the hard_restart
of 4294967296 bytes is command by reading Hard
Stopping and Restarting Agents.
• Tarballs
– To stop the Cloudera
Manager Agent, run this
command on each Agent
host:
$ sudo
tarball_root/etc/init.d/cloudera-scm-agent
hard_restart
$ sudo -u
cloudera-scm
tarball_root/etc/init.d/cloudera-scm-agent
hard_restart
export
CMF_SUDO_CMD="
"
USER=cloudera-scm
GROUP=cloudera-scm
$ sudo
tarball_root/etc/init.d/cloudera-scm-agent
hard_restart
You see the following error in You upgraded CDH to 5.2 using Stop the HDFS service in Cloudera
NameNode log: Cloudera Manager and did not run Manager and follow the steps for
the HDFS Metadata Upgrade upgrade (depending on whether you
2014-10-16 18:36:29,112 command. are using packages or parcels)
WARN
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: described in Upgrading to CDH 5.2
Encountered exception on page 523.
loading fsimage
java.io.IOException:File
system image contains an
old layout version -55.An
upgrade to version -59 is
required.
Please restart
NameNode with the
"-rollingUpgrade started"
option if a rolling
upgrade is already started;
or restart NameNode with
the "-upgrade"
option to start a
new upgrade.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
2014-10-16
18:36:29,126 INFO
org.mortbay.log: Stopped
[email protected]:50070
2014-10-16
18:36:29,127 WARN
org.apache.hadoop.http.HttpServer2:
HttpServer Acceptor:
isRunning is false.
Rechecking.
2014-10-16
18:36:29,127 WARN
org.apache.hadoop.http.HttpServer2:
HttpServer Acceptor:
isRunning is false
2014-10-16
18:36:29,127 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping NameNode metrics
system...
2014-10-16
18:36:29,128 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NameNode metrics system
stopped.
2014-10-16
18:36:29,128 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
NameNode metrics system
shutdown complete.
2014-10-16
18:36:29,128 FATAL
org.apache.hadoop.hdfs.server.namenode.NameNode:
Exception in namenode join
java.io.IOException: File
system image contains an
old layout version -55.An
upgrade to version -59 is
required.
Please restart
NameNode with the
"-rollingUpgrade started"
option if a rolling
upgrade is already
started; or restart
NameNode with the
"-upgrade" option to start
a new upgrade.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:231)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:994)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:529)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:585)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:751)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:735)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1410)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1476)
2014-10-16
18:36:29,130 INFO
org.apache.hadoop.util.ExitUtil:
Exiting with status 1
2014-10-16
18:36:29,132 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode:
SHUTDOWN_MSG: