CDH4 Installation Guide
CDH4 Installation Guide
Important Notice
(c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property.
The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document.
Cloudera, Inc. 1001 Page Mill Road, Building 2 Palo Alto, CA 94304-1008 [email protected] US: 1-888-789-1488 Intl: 1-650-362-0488 www.cloudera.com
Release Information
Version: 4.4.0 Date: September 17, 2013
Table of Contents
About this Guide ............................................................................................................................15
CDH4 Components....................................................................................................................................15 Other Topics in this Guide.........................................................................................................................16
What's New in CDH4.......................................................................................................................17 Before You Install CDH4 on a Cluster ..........................................................................................19 CDH4 Installation............................................................................................................................21
CDH4 and MapReduce ..............................................................................................................................21
MapReduce 2.0 (YARN) .........................................................................................................................................21
Upgrading to CDH4....................................................................................................................................41
Step 1: Back Up Configuration Data and Uninstall Components.......................................................................41 Step 2: Back up the HDFS Metadata.....................................................................................................................42 Step 3: Copy the Hadoop Configuration to the Correct Location and Update Alternatives............................43 Step 4: Uninstall CDH3 Hadoop.............................................................................................................................43
Step 5: Download CDH4..........................................................................................................................................44 Step 6a: Install CDH4 with MRv1..........................................................................................................................46 Step 6b: Install CDH4 with YARN..........................................................................................................................48 Step 7: Copy the CDH4 Logging File......................................................................................................................50 Step 7a: (Secure Clusters Only) Set Variables for Secure DataNodes...............................................................50 Step 8: Upgrade the HDFS Metadata....................................................................................................................51 Step 9: Create the HDFS /tmp Directory..............................................................................................................51 Step 10: Start MapReduce (MRv1) or YARN.........................................................................................................52 Step 11: Set the Sticky Bit......................................................................................................................................56 Step 12: Re-Install CDH4 Components.................................................................................................................56 Step 13: Apply Configuration File Changes..........................................................................................................57 Step 14: Finalize the HDFS Metadata Upgrade...................................................................................................57
Post-migration Verification......................................................................................................................60
Configuring a Remote NameNode Storage Directory ........................................................................................93 Configuring the Secondary NameNode ...............................................................................................................94 Enabling Trash .......................................................................................................................................................95 Configuring Storage-Balancing for the DataNodes............................................................................................96 Enabling WebHDFS ................................................................................................................................................97 Configuring LZO .....................................................................................................................................................98 Deploy MRv1 or YARN ...........................................................................................................................................98
Flume Installation........................................................................................................................119
Migrating to Flume 1.x from Flume 0.9.x.............................................................................................119
Step 1: Remove Flume 0.9.x from your cluster.................................................................................................120 Step 2. Install the new version of Flume...........................................................................................................120
Flume Packaging.....................................................................................................................................120 Installing the Flume Tarball (.tar.gz).....................................................................................................121 Installing the Flume RPM or Debian Packages....................................................................................121 Flume Configuration................................................................................................................................122 Verifying the Installation........................................................................................................................123 Running Flume........................................................................................................................................124 Files Installed by the Flume RPM and Debian Packages....................................................................124 Supported Sources, Sinks, and Channels..............................................................................................125
Sources..................................................................................................................................................................125 Sinks.......................................................................................................................................................................127 Channels................................................................................................................................................................128
Sqoop Installation.........................................................................................................................133
Upgrading Sqoop from CDH3 to CDH4..................................................................................................133
Step 1: Remove the CDH3 version of Sqoop......................................................................................................133 Step 2: Install the new version of Sqoop............................................................................................................134
Upgrading Sqoop from an Earlier CDH4 release..................................................................................134 Sqoop Packaging.....................................................................................................................................134 Sqoop Prerequisites................................................................................................................................135 Installing the Sqoop RPM or Debian Packages....................................................................................135 Installing the Sqoop Tarball....................................................................................................................135 Installing the JDBC Drivers.....................................................................................................................136
Installing the MySQL JDBC Driver.......................................................................................................................136 Installing the Oracle JDBC Driver........................................................................................................................136 Installing the Microsoft SQL Server JDBC Driver...............................................................................................137 Installing the PostgreSQL JDBC Driver...............................................................................................................137
Sqoop 2 Installation.....................................................................................................................139
About Sqoop 2..........................................................................................................................................139
Sqoop 2 Packaging...............................................................................................................................................139
Hue Installation............................................................................................................................145
Supported Browsers................................................................................................................................146 Upgrading Hue.........................................................................................................................................146
Upgrading Hue from CDH3 to CDH4...................................................................................................................146 Upgrading Hue from an Earlier CDH4 Release to the Latest CDH4 Release.................................................147
Installing Hue...........................................................................................................................................148
Installing the Hue Packages................................................................................................................................148
Pig Installation..............................................................................................................................177
Upgrading Pig...........................................................................................................................................177
Upgrading Pig from CDH3 to CDH4.....................................................................................................................177 Upgrading Pig from an Earlier CDH4 release.....................................................................................................178
Installing Pig............................................................................................................................................179 Using Pig with HBase..............................................................................................................................180 Installing DataFu.....................................................................................................................................180 Viewing the Pig Documentation............................................................................................................181
Oozie Installation..........................................................................................................................183
About Oozie..............................................................................................................................................183 Oozie Packaging.......................................................................................................................................183 Oozie Prerequisites.................................................................................................................................184 Upgrading Oozie......................................................................................................................................184
Upgrading Oozie from CDH3 to CDH4................................................................................................................184 Upgrading Oozie from an Earlier CDH4 Release to the Latest CDH4 Release...............................................185
Hive Installation............................................................................................................................203
About Hive................................................................................................................................................203
HiveServer2...........................................................................................................................................................204
Upgrading Hive........................................................................................................................................204
Upgrading Hive from CDH3 to CDH4..................................................................................................................204 Upgrading Hive from an Earlier Version of CDH4.............................................................................................207
Configuring HiveServer2.........................................................................................................................220
Table Lock Manager (Required)...........................................................................................................................220 hive.zookeeper.client.port...................................................................................................................................221 JDBC driver.............................................................................................................................................................221 Authentication......................................................................................................................................................221 Configuring HiveServer2 for YARN.....................................................................................................................222 Running HiveServer2 and HiveServer Concurrently.........................................................................................222
Starting the Metastore...........................................................................................................................222 File System Permissions........................................................................................................................223 Starting, Stopping, and Using HiveServer2...........................................................................................223
Using the BeeLine CLI..........................................................................................................................................224
Starting the Hive Console.......................................................................................................................224 Using Hive with HBase...........................................................................................................................224 Installing the Hive JDBC on Clients........................................................................................................225 Setting HADOOP_MAPRED_HOME for YARN......................................................................................225 Configuring the Metastore to use HDFS High Availability..................................................................225 Troubleshooting.......................................................................................................................................226
Hive Queries Fail with "Too many counters" Error............................................................................................226
Configuration Change on Hosts Used with HCatalog..........................................................................229 Starting and Stopping the WebHCat REST server...............................................................................229 Accessing Table Information with the HCatalog Command-line API................................................229 Accessing Table Data with MapReduce................................................................................................229 Accessing Table Data with Pig ..............................................................................................................231 Accessing Table Information with REST...............................................................................................232 Viewing the HCatalog Documentation..................................................................................................232
HBase Installation........................................................................................................................233
Upgrading HBase.....................................................................................................................................233
About Checksums in CDH4.2...............................................................................................................................234 Upgrading HBase from CDH3 to CDH4...............................................................................................................234 Upgrading HBase from an Earlier CDH4 Release..............................................................................................236
Installing the HBase Master...............................................................................................................................239 Starting the HBase Master..................................................................................................................................240 Installing and Configuring REST..........................................................................................................................240
Accessing HBase by using the HBase Shell.........................................................................................244 Using MapReduce with HBase...............................................................................................................245 Troubleshooting.......................................................................................................................................245
Table Creation Fails after Installing LZO............................................................................................................245
ZooKeeper Installation.................................................................................................................259
Upgrading ZooKeeper from CDH3 to CDH4..........................................................................................259
Step 1: Remove ZooKeeper..................................................................................................................................260 Step 2: Install the ZooKeeper Base Package.....................................................................................................260 Step 3: Install the ZooKeeper Server Package...................................................................................................260 Step 4: Edit /etc/zookeeper/conf/zoo.cfg or Move the Data Directory........................................................261 Step 5: Restart the Server....................................................................................................................................261
Whirr Installation.........................................................................................................................267
Upgrading Whirr......................................................................................................................................267
Upgrading Whirr from CDH3 to CDH4................................................................................................................267 Upgrading Whirr from an Earlier CDH4 Release to the Latest CDH4 Release...............................................269
Launching a Cluster.................................................................................................................................271
Running a Whirr Proxy.........................................................................................................................................271 Running a MapReduce job...................................................................................................................................271 Destroying a cluster.............................................................................................................................................272
Snappy Installation......................................................................................................................273
Upgrading Snappy to CDH4....................................................................................................................273 Snappy Installation..................................................................................................................................273 Using Snappy for MapReduce Compression........................................................................................273 Using Snappy for Pig Compression.......................................................................................................275 Using Snappy for Hive Compression.....................................................................................................275 Using Snappy compression in Sqoop Imports.....................................................................................275 Using Snappy Compression with HBase...............................................................................................275 Viewing the Snappy Documentation.....................................................................................................275
Mahout Installation......................................................................................................................277
Upgrading Mahout...................................................................................................................................277
Upgrading Mahout from CDH3 to CDH4............................................................................................................277 Upgrading Mahout from an Earlier CDH4 Release to the Latest CDH4 Release...........................................278
Installing Mahout....................................................................................................................................278 The Mahout Executable..........................................................................................................................279 Getting Started with Mahout.................................................................................................................279 The Mahout Documentation..................................................................................................................279
HttpFS Installation.......................................................................................................................281
About HttpFS...........................................................................................................................................281 HttpFS Packaging....................................................................................................................................282
Starting the HttpFS Server.....................................................................................................................284 Stopping the HttpFS Server....................................................................................................................284 Using the HttpFS Server with curl.........................................................................................................284
Avro Usage....................................................................................................................................287
Avro Data Files.........................................................................................................................................287 Compression............................................................................................................................................287 Flume........................................................................................................................................................288 Sqoop........................................................................................................................................................288 MapReduce...............................................................................................................................................288 Streaming.................................................................................................................................................289 Pig..............................................................................................................................................................289 Hive...........................................................................................................................................................290 Avro Tools.................................................................................................................................................291
Sentry Installation........................................................................................................................293
Installing Sentry.......................................................................................................................................293
Building an RPM......................................................................................................................................319
Getting Support.............................................................................................................................321
Cloudera Support.....................................................................................................................................321 Community Support................................................................................................................................321 Report Issues...........................................................................................................................................321 Get Announcements about New CDH and Cloudera Manager Releases..........................................322
CDH4 Components
Use the following sections to install or upgrade CDH4 components: Flume Sqoop Sqoop2 Hue Pig Oozie Hive HCatalog HBase ZooKeeper Whirr Snappy Mahout HttpFS Avro
Important: Upgrading from CDH3: If you are upgrading from CDH3, you must first uninstall CDH3, then install CDH4; see Upgrading from CDH3 to CDH4. Before you install CDH4 on a cluster, there are some important steps you need to do to prepare your system: 1. Verify you are using a supported operating system for CDH4. See CDH4 Requirements and Supported Versions. 2. If you haven't already done so, install the Oracle Java Development Kit. For instructions and recommendations, see Java Development Kit Installation. Important: On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES distribution; Hadoop will not run correctly with that version. Install the Oracle JDK following directions under Java Development Kit Installation.
CDH4 Installation
CDH4 Installation
This section describes how to install CDH4. See the following sections for more information and instructions: CDH4 and MapReduce Ways to Install CDH4 Before You Begin Installing CDH4 Installing CDH4 Components Apache Hadoop Documentation
CDH4 Installation
See also Selecting Appropriate JAR files for your MRv1 and YARN Jobs.
The following instructions describe downloading and installing the "1-click Install" package, adding a repository, and building your own repository. If you use one of these methods rather than Cloudera Manager, the first of these methods (downloading and installing the "1-click Install" package) is recommended in most cases because it is simpler than building or adding a repository.
CDH4 Installation
Important: Java Development Kit: if you have not already done so, install the Oracle Java Development Kit (JDK); see Java Development Kit Installation. Scheduler defaults: note the following differences between MRv1 and MRv2 (YARN). MRv1: Cloudera Manager sets the default to Fair Scheduler. CDH4 sets the default to Fair Scheduler, with FIFO and Fair Scheduler on the classpath by default. MRv2 (YARN): Cloudera Manager sets the default to FIFO. CDH4 sets the default to FIFO, with FIFO, Fair Scheduler, and Capacity Scheduler on the classpath by default.
Installing CDH4
This section describes the process for installing CDH4.
Step 1: Add or Build the CDH4 Repository or Download the "1-click Install" package.
If you are installing CDH4 on a Red Hat system, you can download Cloudera packages using yum or your web browser. If you are installing CDH4 on a SLES system, you can download the Cloudera packages using zypper or YaST or your web browser. If you are installing CDH4 on an Ubuntu or Debian system, you can download the Cloudera packages using apt or your web browser.
CDH4 Installation
On Red Hat-compatible Systems Use one of the following methods to add or build the CDH4 repository or download the package on Red Hat-compatible systems: Download and install the CDH4 "1-click Install" package or Add the CDH4 repository or Build a Yum Repository Do this on all the systems in the cluster. To download and install the CDH4 "1-click Install" package: 1. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory). For OS Version Click this Link
Red Hat/CentOS/Oracle 5
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To add the CDH4 repository: Click the entry in the table below that matches your Red Hat or CentOS system, navigate to the repo file for your system and save it in the /etc/yum.repos.d/ directory. For OS Version Click this Link
Red Hat/CentOS/Oracle 5
CDH4 Installation
For OS Version
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To build a Yum repository: If you want to create your own yum repository, download the appropriate repo file, create the repo, distribute the repo file and set up a web server, as described under Creating a Local Yum Repository. Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. On SLES Systems Use one of the following methods to download the CDH4 repository or package on SLES systems: Download and install the CDH4 "1-click Install" Packageor Add the CDH4 repositoryor Build a SLES Repository To download and install the CDH4 "1-click Install" package: 1. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory). 2. Install the RPM:
$ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To add the CDH4 repository: 1. Run the following command:
$ sudo zypper addrepo -f http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/cloudera-cdh4.repo
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To build a SLES repository: If you want to create your own SLES repository, create a mirror of the CDH SLES directory by following these instructions that explain how to create a SLES repository from the mirror. Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.
CDH4 Installation
On Ubuntu or Debian Systems Use one of the following methods to download the CDH4 repository or package: Download and install the CDH4 "1-click Install" Package or Add the CDH4 repositoryor Build a Debian Repository To download and install the CDH4 "1-click Install" package: 1. Click one of the following: this link for a Squeeze system, orthis link for a Lucid systemthis link for a Precise system. 2. Install the package. Do one of the following: Choose Open with in the download window to use the package manager, or Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
sudo dpkg -i cdh4-repository_1.0_all.deb
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To add the CDH4 repository: Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents: For Ubuntu systems:
deb [arch=amd64] http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib deb-src http://archive.cloudera.com/cdh4/<OS-release-arch><RELEASE>-cdh4 contrib
where: <OS-release-arch> is debian/squeeze/amd64/cdh, ubuntu/lucid/amd64/cdh, or ubuntu/precise/amd64/cdh, and <RELEASE> is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH4 for 64-bit Ubuntu Lucid:
deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4 contrib deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4 contrib
Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations. To build a Debian repository: If you want to create your own apt repository, create a mirror of the CDH Debian directory and then create an apt repository from the mirror. Now continue with Step 1a: Optionally Add a Repository Key, and then choose Step 2: Install CDH4 with MRv1, or Step 3: Install CDH4 with YARN; or do both steps if you want to install both implementations.
CDH4 Installation
This key enables you to verify that you are downloading genuine packages.
CDH4 Installation
Important: Before proceeding, you need to decide: 1. Whether to configure High Availability (HA) for the NameNode and/or JobTracker; see the CDH4 High Availability Guide for more information and instructions. 2. Where to deploy the NameNode, Secondary NameNode, and JobTracker daemons. As a general rule: The NameNode and JobTracker run on the the same "master" host unless the cluster is large (more than a few tens of nodes), and the master host (or hosts) should not run the Secondary NameNode (if used), DataNode or TaskTracker services. In a large cluster, it is especially important that the Secondary NameNode (if used) runs on a separate machine from the NameNode. Each node in the cluster except the master host(s) should run the DataNode and TaskTracker services. If you configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode. After completing the software configuration for your chosen HA method, follow the installation instructions under HDFS High Availability Initial Deployment. 1. Install and deploy ZooKeeper. Important: Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker. Follow instructions under ZooKeeper Installation. 2. Install each type of daemon package on the appropriate systems(s), as follows. Where to install Install commands
SLES
Ubuntu or Debian
CDH4 Installation
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:
SLES
Ubuntu or Debian
SLES
CDH4 Installation
Where to install
Install commands
Ubuntu or Debian
SLES
Ubuntu or Debian
CDH4 Installation
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
All cluster hosts except the Resource Manager (analogous to MRv1 TaskTrackers) running:
SLES
Ubuntu or Debian
CDH4 Installation
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
Note: The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.
Red Hat/CentOS/Oracle 5
Navigate to this link and save the file in the /etc/yum.repos.d/ directory.
Red Hat/CentOS 6
Navigate to this link and save the file in the /etc/yum.repos.d/ directory.
SLES
CDH4 Installation
For OS Version
Do this
$ sudo zypper addrepo -f http://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/ cloudera-gplextras4.repo
Ubuntu or Debian
Important: Make sure you do not let the file name default to cloudera.list, as that will overwrite your existing cloudera.list.
2. Install the package on each node as follows: For OS version Install commands
SLES
Ubuntu or Debian
3. Continue with installing and deploying CDH. As part of the the deployment, you will need to do some additional configuration for LZO, as shown under Configuring LZO on page 98. Important: Make sure you do this configuration after you have copied the default configuration files to a custom location and set alternatives to point to it.
CDH4 Installation
reliable logging. The primary use case is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as HDFS. Sqoop A tool that imports data from relational databases into Hadoop clusters. Using JDBC to interface with databases, Sqoop imports the contents of tables into a Hadoop Distributed File System (HDFS) and generates Java classes that enable users to interpret the table's schema. Sqoop can also export records from HDFS to a relational database. Sqoop 2 A server-based tool for transferring data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS. HCatalog A tool that provides table data access for CDH components such as Pig and MapReduce. Hue A graphical user interface to work with CDH. Hue aggregates several applications which are collected into a desktop-like environment and delivered as a Web application that requires no client installation by individual users. Pig Enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster. Hive A powerful data warehousing application built on top of Hadoop which enables you to access your data using Hive QL, a language that is similar to SQL. HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster. Zookeeper A highly reliable and available service that provides coordination between distributed processes. Oozie A server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs. A command line client is also available that allows remote administration and management of workflows within the Oozie server. Whirr Provides a fast way to run cloud services. Snappy A compression/decompression library. You do not need to install Snappy if you are already using the native library, but you do need to configure it; see Snappy Installation for more information. Mahout A machine-learning tool. By enabling you to build machine-learning libraries that are scalable to "reasonably large" datasets, it aims to make building intelligent applications easier and faster.
To install the CDH4 components, see the following sections: Flume. For more information, see "Flume Installation" in this guide. Sqoop. For more information, see "Sqoop Installation" in this guide. Sqoop 2. For more information, see "Sqoop 2 Installation" in this guide. HCatalog. For more information, see "Installing and Using HCatalog" in this guide. Hue. For more information, see "Hue Installation" in this guide. Pig. For more information, see "Pig Installation" in this guide. Oozie. For more information, see "Oozie Installation" in this guide. Hive. For more information, see "Hive Installation" in this guide. HBase. For more information, see "HBase Installation" in this guide. ZooKeeper. For more information, "ZooKeeper Installation" in this guide. Whirr. For more information, see "Whirr Installation" in this guide. Snappy. For more information, see "Snappy Installation" in this guide. Mahout. For more information, see "Mahout Installation" in this guide.
Red Hat/CentOS/Oracle 5
Step 2. Edit the Repo File Open the repo file you have just saved and change the 4 at the end of the line that begins baseurl= to the version number you want. For example, if you have saved the file for Red Hat 6, it will look like this when you open it for editing:
[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/4/ gpgkey = http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera gpgcheck = 1
Step 3: Proceed with the Installation 1. Go to Documentation for CDH 4 Releases. 2. Find the Installation Guide for your release; for example, for CDH4.0.0 or CDH4.0.1, scroll down to the link for the "CDH4.0.0 Installation Guide" and click on the link to the PDF. 3. Follow the instructions on the "CDH4 Installation" page, starting with the instructions for optionally adding a repository key. (This comes immediately before the steps for installing CDH4 with MRv1 or YARN, and is usually Step 1a.)
On SLES systems
Step 1. Add the Cloudera Repo 1. Run the following command:
$ sudo zypper addrepo -f http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/cloudera-cdh4.repo
Step 2. Edit the Repo File Open the repo file that you have just added to your system and change the 4 at the end of the line that begins baseurl= to the version number you want. The file should look like this when you open it for editing:
[cloudera-cdh4] name=Cloudera's Distribution for Hadoop, Version 4 baseurl=http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/4/ gpgkey = http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera gpgcheck = 1
where: <OS-release-arch> is debian/squeeze/amd64/cdh, ubuntu/lucid/amd64/cdh, or ubuntu/precise/amd64/cdh, and <RELEASE> is the name of your distribution, which you can find by running lsb_release -c. Now replace -cdh4 near the end of each line (before contrib) with the CDH release you need to install. Here are some examples using CDH4.0.0: For 64-bit Ubuntu Lucid:
deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4.0.0 contrib deb-src http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh lucid-cdh4.0.0 contrib
High Availability
In CDH4 you can configure high availability both for the NameNode and the JobTracker. For more information and instructions, see the CDH4 High Availability Guide. Important: If you configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode. After completing the software configuration for your chosen HA method, follow the installation instructions under HDFS High Availability Initial Deployment.
Plan Downtime
If you are upgrading a cluster that is part of a production system, be sure to plan ahead. As with any operational work, be sure to reserve a maintenance window with enough extra time allotted in case of complications. The Hadoop upgrade process is well understood, but it is best to be cautious. For production clusters, Cloudera recommends allocating up to a full day maintenance window to perform the upgrade, depending on the number of hosts, the amount of experience you have with Hadoop and Linux, and the particular hardware you are using.
Upgrading to CDH4
Use the instructions that follow to upgrade to CDH4. Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)
This will result in a new fsimage being written out with no edit log entries. c. With the NameNode still in safe mode, shut down all services as instructed below. 2. For each component you are using, back up configuration data, databases, and other important files, stop the component, then uninstall it. See the following sections for instructions: Note: At this point, you are only removing the components; do not install the new versions yet. Removing Flume 0.9.x Removing Sqoop Removing Hue Removing Pig Removing Oozie Removing Hive Removing HBase Removing ZooKeeper
4. Check each host to make sure that there are no processes running as the hdfs or mapred users from root:
# ps -aef | grep java
2. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back up the first directory, for example, by using the following commands:
$ cd /mnt/hadoop/hdfs/name # tar -cvf /root/nn_backup_data.tar . ./ ./current/ ./current/fsimage ./current/fstime ./current/VERSION ./current/edits ./image/ ./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the Hadoop services.
Step 3: Copy the Hadoop Configuration to the Correct Location and Update Alternatives
For CDH4, Hadoop looks for the cluster configuration files in a different location from the one used in CDH3, so you need to copy the configuration to the new location and reset the alternatives to point to it. Proceed as follows. On each node in the cluster: 1. Copy the existing configuration to the new location, for example:
$ cp -r /etc/hadoop-0.20/conf.my_cluster /etc/hadoop/conf.my_cluster
On SLES systems:
$ sudo zypper remove hadoop-0.20 bigtop-utils
On Ubuntu systems:
sudo apt-get purge hadoop-0.20 bigtop-utils
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up. To uninstall the repository packages, run this command on each host:
On SLES systems:
$ sudo zypper remove cloudera-cdh
Important: On Ubuntu and Debian systems, you need to re-create the /usr/lib/hadoop-0.20/ directory after uninstalling CDH3. Make sure you do this before you install CDH4:
$ sudo mkdir -p /usr/lib/hadoop-0.20/
Red Hat/CentOS/Oracle 5
Note: For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems. 4. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing one of the following commands: For Red Hat/CentOS/Oracle 5 systems:
$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera
On SLES systems: 1. Download the CDH4 "1-click Install" Package: 2. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory). 3. Install the RPM:
$ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm
Note: For instructions on how to add a repository or build your own repository, see Installing CDH4 on SLES Systems. 4. Update your system package index by running:
$ sudo zypper refresh
5. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing the following command: For all SLES systems:
$ sudo rpm --import http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera
On Ubuntu and Debian systems: 1. Download the CDH4 "1-click Install" Package: 2. Click one of the following: this link for a Squeeze system, or this link for a Lucid system, or this link for a Precise system. 3. Install the package. Do one of the following: Choose Open with in the download window to use the package manager, or Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
sudo dpkg -i cdh4-repository_1.0_all.deb
Note: For instructions on how to add a repository or build your own repository, see Installing CDH4 on Ubuntu Systems. 4. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing one of the following commands: For Ubuntu Lucid systems:
$ curl -s http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
SLES
Where to install
Install commands
Ubuntu or Debian
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:
SLES
Ubuntu or Debian
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
All cluster hosts except the Resource Manager (analogous to MRv1 TaskTrackers) running:
SLES
Ubuntu or Debian
SLES
Where to install
Install commands
Ubuntu or Debian
SLES
Ubuntu or Debian
Note: The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.
Step 7a: (Secure Clusters Only) Set Variables for Secure DataNodes
Important: You must do the following if you are upgrading a CDH3 cluster that has Kerberos security enabled. Otherwise, skip this step. In order to allow DataNodes to start on a secure Hadoop cluster, you must set the following variables on all DataNodes in /etc/default/hadoop-hdfs-datanode.
export export export export HADOOP_SECURE_DN_USER=hdfs HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs JSVC_HOME=/usr/lib/bigtop-utils/
Note: Depending on the version of Linux you are using, you may not have the /usr/lib/bigtop-utils directory on your system. If that is the case, set the JSVC_HOME variable to the /usr/libexec/bigtop-utils directory by using this command: export
JSVC_HOME=/usr/libexec/bigtop-utils
Note: The NameNode upgrade process can take a while depending on how many files you have. You can watch the progress of the upgrade by running:
$ sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log
Look for a line that confirms the upgrade is complete, such as:
/var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
3. Wait for NameNode to exit safe mode, and then start the Secondary NameNode (if used) and complete the cluster upgrade. a. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web interface, that say "...no longer in safe mode." b. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode host:
$ sudo service hadoop-hdfs-secondarynamenode start
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
Start MRv1
Step 10a: Start MapReduce (MRv1) Important: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 9a and 9b are mutually exclusive. After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly. Verify basic cluster operation for MRv1. At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.
Do the following steps as the user joe. 2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
3. Run an example Hadoop job to grep with a regular expression in your input data.
$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
4. After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup drwxr-xr-x - joe supergroup
You can see that there is a new directory called output. 5. List the output files.
$ hadoop fs -ls output Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000 -rw-r--r1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_SUCCESS
You have now confirmed your cluster is successfully running CDH4. Important: If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.
You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.
To start YARN, start the ResourceManager and NodeManager services: Note: Make sure you always start ResourceManager before starting NodeManager services. On the ResourceManager system:
$ sudo service hadoop-yarn-resourcemanager start
On each NodeManager system (typically the same ones where DataNode service runs):
$ sudo service hadoop-yarn-nodemanager start
For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Verify basic cluster operation for YARN. At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site. Note: For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster. 1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
$ sudo -u hdfs hadoop fs -mkdir /user/joe $ sudo -u hdfs hadoop fs -chown joe /user/joe
Do the following steps as the user joe. 2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
4. Run an example Hadoop job to grep with a regular expression in your input data.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup drwxr-xr-x - joe supergroup
You can see that there is a new directory called output23. 6. List the output files.
$ hadoop fs -ls output23 Found 2 items
drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
You have now confirmed your cluster is successfully running CDH4. Important: If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.
Note: If you need to restart the NameNode during this period (after having begun the upgrade process, but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade option. 2. Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos is enabled (see Configuring Hadoop Security in CDH4). If Kerberos is enabled:
$ kinit -kt /path/to/hdfs.keytab hdfs/<[email protected]> && hdfs dfsadmin -finalizeUpgrade
Note: After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.
Requirements
1. The CDH4 cluster must have a MapReduce service running on it. This may be MRv1 or YARN (MRv2). 2. All the MapReduce nodes in the CDH4 cluster should have full network access to all the nodes of the source cluster. This allows you to perform the copy in a distributed manner. Note: The term source refers to the CDH3 (or other Hadoop) cluster you want to migrate or copy data from; and destination refers to the CDH4 cluster.
Run the DistCp copy by issuing a command such as the following on the CDH4 cluster:
$ hadoop distcp hftp://cdh3-namenode:50070/ hdfs://cdh4-nameservice/
DistCp will then submit a regular MapReduce job that performs a file-by-file copy.
Post-migration Verification
After migrating data between the two clusters, it is a good idea to use hadoop fs -ls /basePath to verify the permissions, ownership and other aspects of your files, and correct any problems before using the files in your new cluster.
Important: Use the right instructions: the following instructions describe how to upgrade to the latest CDH4 release from an earlier CDH4 release. If you are upgrading from a CDH3 release, use the instructions under Upgrading from CDH3 to CDH4 instead. MapReduce v1 (MRv1) and MapReduce v2 (YARN): this page covers upgrade for MapReduce v1 (MRv1) and MapReduce v2 (YARN). MRv1 and YARN share common configuration files, so it is safe to configure both of them so long as you run only one set of daemons at any one time. Cloudera does not support running MRv1 and YARN daemons on the same nodes at the same time; it will degrade performance and may result in an unstable cluster deployment. Before deciding to deploy YARN, make sure you read the discussion on the CDH4 Installation page under MapReduce 2.0 (YARN).
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) The following sections provide the information and instructions you need: Before You Begin Upgrading from an Earlier CDH4 Release to the Latest Version
2. Check each host to make sure that there are no processes running as the hdfs, yarn, mapred or httpfs users from root:
# ps -aef | grep java
Important: When you are sure that all Hadoop services have been shut down, do the following step. It is particularly important that the NameNode service is not running so that you can make a consistent backup. 3. Back up the HDFS metadata on the NameNode machine, as follows.
Note: Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major upgrade. dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This example uses dfs.name.dir. a. Find the location of your dfs.name.dir (or dfs.namenode.name.dir); for example:
$ grep -C1 dfs.name.dir /etc/hadoop/conf/hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/mnt/hadoop/hdfs/name</value> </property>
b. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back up the first directory, for example, by using the following commands:
$ cd /mnt/hadoop/hdfs/name # tar -cvf /root/nn_backup_data.tar . ./ ./current/ ./current/fsimage ./current/fstime ./current/VERSION ./current/edits ./image/ ./image/fsimage
Warning: If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps from the beginning; start at Step 1 and shut down the Hadoop services.
Step 2: Download the CDH4 package on each of the hosts in your cluster.
Before you begin: Check whether you have the CDH4 "1-click" repository installed. On Red Hat/CentOS-compatible and SLES systems:
rpm -q cdh4-repository
If you are upgrading from CDH4 Beta 1 or later, you should see:
cdh4-repository-1-0
If the repository is installed, skip to Step 3; otherwise proceed with this step.
Red Hat/CentOS/Oracle 5
Note: For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.
On SLES systems: 1. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory). 2. Install the RPM:
$ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm
Note: For instructions on how to add a repository or build your own repository, see Installing CDH4 on SLES Systems. Now update your system package index by running:
$ sudo zypper refresh
On Ubuntu and Debian systems: 1. Click one of the following: this link for a Squeeze system, orthis link for a Lucid systemthis link for a Precise system. 2. Install the package. Do one of the following: Choose Open with in the download window to use the package manager, or Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
sudo dpkg -i cdh4-repository_1.0_all.deb
Note: For instructions on how to add a repository or build your own repository, see Installing CDH4 on Ubuntu Systems.
Step 3a: If you are using MRv1, upgrade the MRv1 packages on the appropriate hosts. Skip this step if you are using YARN exclusively. Otherwise upgrade each type of daemon package on the appropriate hosts as follows: 1. Install and deploy ZooKeeper: Important: Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker. Follow instructions under ZooKeeper Installation. 2. Install each type of daemon package on the appropriate systems(s), as follows. Where to install Install commands
SLES
Ubuntu or Debian
Where to install
Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:
SLES
Ubuntu or Debian
SLES
Where to install
Install commands
Ubuntu or Debian
Step 3b: If you are using YARN, upgrade the YARN packages on the appropriate hosts. Skip this step if you are using MRv1 exclusively. Otherwise upgrade each type of daemon package on the appropriate hosts as follows: 1. Install and deploy ZooKeeper: Important: Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker. Follow instructions under ZooKeeper Installation. 2. Install each type of daemon package on the appropriate systems(s), as follows. Where to install Install commands
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
Where to install
Install commands
SLES
Ubuntu or Debian
All cluster hosts except the Resource Manager (analogous to MRv1 TaskTrackers) running:
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
Where to install
Install commands
SLES
Ubuntu or Debian
Note: The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.
Note: The NameNode upgrade process can take a while depending on how many files you have. You can watch the progress of the upgrade by running:
$ sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log
Look for a line that confirms the upgrade is complete, such as:
/var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete
3. Wait for NameNode to exit safe mode, and then start the Secondary NameNode (if used) and complete the cluster upgrade. a. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web interface, that say "...no longer in safe mode." b. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode host:
$ sudo service hadoop-hdfs-secondarynamenode start
Step 5a: Verify that /tmp Exists and Has the Right Permissions
Important: If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it. Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:
$ sudo -u hdfs hadoop fs -mkdir /tmp $ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
For MRv1
For YARN
Start MRv1
Step 6a: Start MapReduce (MRv1) Important: Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 6a and 6b are mutually exclusive. After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker system:
$ sudo service hadoop-0.20-mapreduce-tasktracker start
If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly. Verify basic cluster operation for MRv1. At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site. Before you proceed, you make sure the HADOOP_HOME environment variable is unset:
$ unset HADOOP_HOME
Note: To submit MapReduce jobs using MRv1 in CDH4 Beta 1, you needed either to set the HADOOP_HOME environment variable or run a launcher script. This is no longer true in later CDH4 releases; the HADOOP_HOME has been now fully deprecated and it is good practice to unset it.
Note: For important configuration information, see Deploying MapReduce v1 (MRv1) on a Cluster.
Do the following steps as the user joe. 2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
3. Run an example Hadoop job to grep with a regular expression in your input data.
$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
4. After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output
You can see that there is a new directory called output. 5. List the output files.
$ hadoop fs -ls output Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000 -rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_SUCCESS
You have now confirmed your cluster is successfully running CDH4. Important: If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.
Note: You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml. Verify the directory structure, ownership, and permissions:
$ sudo -u hdfs hadoop fs -ls -R /
To start YARN, start the ResourceManager and NodeManager services: Note: Make sure you always start ResourceManager before starting NodeManager services. On the ResourceManager system:
$ sudo service hadoop-yarn-resourcemanager start
On each NodeManager system (typically the same ones where DataNode service runs):
$ sudo service hadoop-yarn-nodemanager start
For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Verify basic cluster operation for YARN. At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site. Note: For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster. 1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
$ sudo -u hdfs hadoop fs -mkdir /user/joe sudo -u hdfs hadoop fs -chown joe /user/joe
Do the following steps as the user joe. 2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
$ hadoop fs -mkdir input $ hadoop fs -put /etc/hadoop/conf/*.xml input $ hadoop fs -ls input Found 3 items: -rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml -rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml -rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml
4. Run an example Hadoop job to grep with a regular expression in your input data.
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
$ hadoop fs -ls Found 2 items drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23
You can see that there is a new directory called output23. 6. List the output files:
$ hadoop fs -ls output23 Found 2 items drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS -rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000
You have now confirmed your cluster is successfully running CDH4. Important: If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.
Note: If you need to restart the NameNode during this period (after having begun the upgrade process, but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade option. 2. Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos is enabled (see Configuring Hadoop Security in CDH4). If Kerberos is enabled:
$ kinit -kt /path/to/hdfs.keytab hdfs/<[email protected]> && hdfs dfsadmin -finalizeUpgrade
Note: After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.
Hadoop HDFS
DataNode
50010 TCP
External
dfs.datanode. address
DataNode
External
dfs.datanode. address
DataNode
50075 TCP
External
dfs.datanode.http. address
DataNode
External
dfs.datanode.http. address
DataNode
50020 TCP
External
dfs.datanode.ipc. address
NameNode
8020 TCP
External
fs.default. name
fs.default. name
or
fs.defaultFS
NameNode
50070 TCP
External
dfs.http. address
dfs.http. address
or
dfs.namenode. http-address
NameNode
External
dfs.https. address
dfs.https. address
or
Component Service
Qualifier Port
Comment
Secondary NameNode
50090 TCP
Internal
dfs.secondary. http.address
dfs.secondary. http.address
or
dfs.namenode. secondary. http-address
Secondary NameNode
Internal
dfs.secondary. https.address
JournalNode
8485 TCP
Internal
dfs.namenode. shared.edits.dir
JournalNode
8480 TCP
Internal
Hadoop MRv1
JobTracker
8021 TCP
External
mapred.job. tracker
JobTracker
50030 TCP
External
JobTracker
Internal
jobtracker. thrift.address
TaskTracker
50060 TCP
External
TaskTracker
TCP
Localhost mapred.task.
tracker.report. address
Hadoop YARN
ResourceManager
8032 TCP
ResourceManager
8030 TCP
Component Service
Qualifier Port
Comment
ResourceManager
8031 TCP
ResourceManager
8033 TCP
ResourceManager
8088 TCP
NodeManager
8040 TCP
NodeManager
8042 TCP
NodeManager
8041 TCP
10020 TCP
19888 TCP
HBase
Master
60000 TCP
External
hbase.master. port
IPC
Master
60010 TCP
External
hbase.master. info.port
HTTP
RegionServer
60020 TCP
External
IPC
RegionServer
60030 TCP
External
HTTP
Component Service
Qualifier Port
Comment
HQuorumPeer
2181 TCP
HBase-managed ZK mode
HQuorumPeer
2888 TCP
HBase-managed ZK mode
HQuorumPeer
3888 TCP
HBase-managed ZK mode
REST
External
hbase.rest. port
REST UI
8085 TCP
External
ThriftServer
External
ThriftServer
9095 TCP
External
External
on CLI
Hive
Metastore
9083 TCP
External
HiveServer
10000 TCP
External
Sqoop
Metastore
16000 TCP
External
Sqoop 2
Sqoop 2 server
12000 TCP
External
Sqoop 2
Sqoop 2
8005 TCP
External
Admin port
2181 TCP
External
clientPort
Client port
Component Service
Qualifier Port
Comment
Cloudera Manager 4)
2888 TCP
Internal
X in server.N =host:X:Y
Peer
3888 TCP
Internal
X in server.N =host:X:Y
Peer
3181 TCP
Internal
X in server.N =host:X:Y
Peer
4181 TCP
Internal
X in server.N =host:X:Y
Peer
8019 TCP
Internal
Used for HA
9010 TCP
Internal
ZooKeeper will also use another randomly selected port for RMI. In order for Cloudera Manager to monitor ZooKeeper, you must open up all ports when the connection originates from the Cloudera Manager server.
Hue
Server
8888 TCP
External
Beeswax Server
8002
Internal
Beeswax Metastore
8003
Internal
Component Service
Qualifier Port
Comment
Oozie
Oozie Server
11000 TCP
External
OOZIE_HTTP_ PORT
HTTP
in
oozie-env.sh
Oozie Server
11001 TCP
localhost OOZIE_ADMIN_
PORT
Shutdown port
in
oozie-env.sh
Ganglia
ganglia-gmond
8649
UDP/TCP
Internal
ganglia-web
80
TCP
External
Via Apache
httpd
Kerberos
Secure
88
UDP/TCP
External
kdc_ports
and
kdc_tcp_ports
in either the
[kdcdefaults]
or
[realms]
sections of
kdc.conf
749
TCP
Internal
kadmind_port
in the
[realms]
section of
kdc.conf
Note: This is a temporary measure only. The hostname set by hostname does not survive across reboots 2. Make sure the /etc/hosts file on each system contains the IP addresses and fully-qualified domain names (FQDN) of all the members of the cluster. Important: The canonical name of each host in /etc/hosts must be the FQDN (for example myhost-1.mynet.myco.com), not the unqualified hostname (for example myhost-1). The canonical name is the first entry after the IP address. If you are using DNS, storing this information in /etc/hosts is not required, but it is good practice. 3. Make sure the /etc/sysconfig/network file on each system contains the hostname you have just set (or verified) for that system, for example myhost-1. 4. Check that this system is consistently identified to the network: a. Run uname -a and check that the hostname matches the output of the hostname command. b. Run /sbin/ifconfig and note the value of inet addr in the eth0 entry, for example:
$ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:0C:29:A4:E8:97 inet addr:172.29.82.176 Bcast:172.29.87.255 Mask:255.255.248.0 ...
c. Run host -v -t A `hostname` and make sure that hostname matches the output of the hostname command, and has the same IP address as reported by ifconfig for eth0; for example:
$ host -v -t A `hostname` Trying "myhost.mynet.myco.com" ...
5. For MRv1: make sure conf/core-site.xml and conf/mapred-site.xml, respectively, have the hostnames not the IP addresses of the NameNode and the JobTracker. These can be FQDNs (for example myhost-1.mynet.myco.com), or unqualified hostnames (for example myhost-1). See Customizing Configuration Files and Deploying MapReduce v1 (MRv1) on a Cluster. 6. For YARN: make sure conf/core-site.xml and conf/yarn-site.xml, respectively, have the hostnames not the IP addresses of the NameNode, the ResourceManager, and the ResourceManager Scheduler. See Customizing Configuration Files and Deploying MapReduce v2 (YARN) on a Cluster. 7. Make sure that components that depend on a client-server relationship Oozie, HBase, ZooKeeper are configured according to the instructions on their installation pages: Oozie Installation HBase Installation ZooKeeper Installation
You can call this configuration anything you like; in this example, it's called my_cluster. 2. Set alternatives to point to your custom directory, as follows.
For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES systems or the alternatives(8) man page On Red Hat-compatible systems. Important: When performing the configuration tasks in this section, and when you go on to deploy MRv1 or YARN, edit the configuration files in your custom directory (for example /etc/hadoop/conf.my_cluster). Do not create your custom configuration in the default directory /etc/hadoop/conf.dist.
Property
Configuration File
Description
fs.defaultFS
conf/core-site.xml
Note: fs.default.name is deprecated. Specifies the NameNode and the default file system, in the form hdfs://<namenode host>:<namenode port>/. The default value is file///. The default file system is used to resolve relative paths; for example, if fs.default.name or fs.defaultFS is set to hdfs://mynamenode/, the relative URI /mydir/myfile resolves to hdfs://mynamenode/mydir/myfile. Note: for the cluster to function correctly, the <namenode> part of the string must be the hostname (for example mynamenode) not the IP address.
Property
Configuration File
Description
dfs.permissions.superusergroup conf/hdfs-site.xml
Specifies the UNIX group containing users that will be treated as superusers by HDFS. You can stick with the value of 'hadoop' or pick your own group depending on the security policies at your site.
hdfs-site.xml:
<property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property>
dfs.name.dir or dfs.namenode.name.dir
directories where the NameNode stores its metadata and edit logs. Cloudera recommends that you specify at least two directories. One of these should be located on an NFS mount point, unless you will be using a High Availability (HA) configuration.
dfs.data.dir or dfs.datanode.data.dir
directories where the DataNode stores blocks. Cloudera recommends that you configure the disks on the DataNode in a JBOD configuration, mounted at /data/1/ through /data/N, and configure dfs.data.dir or
Property
Description
dfs.datanode.data.dir to specify /data/1/dfs/dn through /data/N/dfs/dn/.
Note:
dfs.data.dir and dfs.name.dir are deprecated; you should use dfs.datanode.data.dir and dfs.namenode.name.dir instead, though dfs.data.dir and dfs.name.dir will still work.
After specifying these directories as shown above, you must create the directories and assign the correct file permissions to them on each node in your cluster. In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration. Local directories: The dfs.name.dir or dfs.namenode.name.dir parameter is represented by the /data/1/dfs/nn and /nfsmount/dfs/nn path examples. The dfs.data.dir or dfs.datanode.data.dir parameter is represented by the /data/1/dfs/dn, /data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn path examples. To configure local storage directories for use by HDFS: 1. On a NameNode host: create the dfs.name.dir or dfs.namenode.name.dir local directories:
$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn
Important: If you are using High Availability (HA), you should not configure these directories on an NFS mount; configure them on local storage. 2. On all DataNode hosts: create the dfs.data.dir or dfs.datanode.data.dir local directories:
$ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn
Here is a summary of the correct owner and permissions of the local directories: Directory Owner Permissions (see Footnote 1)
dfs.name.dir or dfs.namenode.name.dir
hdfs:hdfs
drwx------
dfs.data.dir or dfs.datanode.data.dir
hdfs:hdfs
drwx------
Footnote: 1 The Hadoop daemons automatically set the correct permissions for you on dfs.data.dir or dfs.datanode.data.dir. But in the case of dfs.name.dir or dfs.namenode.name.dir, permissions are currently incorrectly set to the file-system default, usually drwxr-xr-x (755). Use the chmod command to reset permissions for these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for example:
$sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn
or
$sudo chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn
Note: If you specified nonexistent directories for the dfs.data.dir or dfs.datanode.data.dir property in the conf/hdfs-site.xml file, CDH4 will shut down. (In previous releases, CDH3 silently ignored nonexistent directories for dfs.data.dir.)
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command> You'll get a confirmation prompt; for example:
Re-format filesystem in /data/namedir ? (Y or N)
Respond with an upper-case Y; if you use lower case, the process will abort.
These options configure a soft mount over TCP; transactions will be retried ten times (retrans=10) at 1-second intervals (timeo=10) before being deemed to have failed. Example:
mount -t nfs -o tcp,soft,intr,timeo=10,retrans=10, <server>:<export> <mount_point>
where <server> is the remote host, <export> is the exported file system, and <mount_point> is the local mount point. Note: Cloudera recommends similar settings for shared HA mounts, as in the example that follows. Example for HA:
mount -t nfs -o tcp,soft,intr,timeo=50,retrans=12, <server>:<export> <mount_point>
Note: dfs.http.address is deprecated; use dfs.namenode.http-address. In most cases, you should set dfs.namenode.http-address to a routable IP address with port 50070. However, in some cases such as Amazon EC2, when the NameNode should bind to multiple local addresses, you may want to set dfs.namenode.http-address to 0.0.0.0:50070on the NameNode machine only, and set it to a real, routable address on the Secondary NameNode machine. The different addresses are needed in this case because HDFS uses dfs.namenode.http-address for two different purposes: it defines both the address the NameNode binds to, and the address the Secondary NameNode connects to for checkpointing. Using 0.0.0.0 on the NameNode allows the NameNode to bind to all its local addresses, while using the externally-routable address on the the Secondary NameNode provides the Secondary NameNode with a real address to connect to.
For more information, see Multi-host SecondaryNameNode Configuration. More about the Secondary NameNode The NameNode stores the HDFS metadata information in RAM to speed up interactive lookups and modifications of the metadata. For reliability, this information is flushed to disk periodically. To ensure that these writes are not a speed bottleneck, only the list of modifications is written to disk, not a full snapshot of the current filesystem. The list of modifications is appended to a file called edits. Over time, the edits log file can grow quite large and consume large amounts of disk space. When the NameNode is restarted, it takes the HDFS system state from the fsimage file, then applies the contents of the edits log to construct an accurate system state that can be loaded into the NameNode's RAM. If you restart a large cluster that has run for a long period with no Secondary NameNode, the edits log may be quite large, and so it can take some time to reconstruct the system state to be loaded into RAM. When the Secondary NameNode is configured, it periodically (once an hour, by default) constructs a checkpoint by compacting the information in the edits log and merging it with the most recent fsimage file; it then clears the edits log. So, when the NameNode restarts, it can use the latest checkpoint and apply the contents of the smaller edits log. Secondary NameNode Parameters The behavior of the Secondary NameNode is controlled by the following parameters in {hdfs-site.xml}}.
dfs.namenode.checkpoint.check.period dfs.namenode.checkpoint.txns dfs.namenode.checkpoint.dir dfs.namenode.checkpoint.edits.dir dfs.namenode.num.checkpoints.retained
Enabling Trash
Important: The trash feature is disabled by default. Cloudera recommends that you enable it on all production clusters.
fs.trash.interval
minutes or 0
The number of minutes after which a trash checkpoint directory is deleted. This option may be configured both on the server and the client; in releases prior to CDH4.1 this option was only configured on the client. If trash is enabled on the server configuration, then the value configured on the server is used and the client configuration is ignored. If trash is disabled in the server configuration, then the client side configuration is checked. If the value of this property is zero (the default), then the trash feature is disabled.
fs.trash.checkpoint.interval minutes or 0
The number of minutes between trash checkpoints. Every time the checkpointer runs on the NameNode, it creates a new checkpoint of the "Current" directory and removes checkpoints older than fs.trash.interval minutes. This value should be smaller than or equal to fs.trash.interval. This option is configured on the server. If configured to zero (the default), then the value is set to the value of fs.trash.interval.
For example, to enable trash so that files deleted using the Hadoop shell are not deleted for 24 hours, set the value of the fs.trash.interval property in the server's core-site.xml file to a value of 1440.
Value
Description
org.apache.hadoop.hdfs.server. Enables storage balancing among datanode.fsdataset. the DataNode's volumes. AvailableSpaceVolumeChoosingPolicy
dfs.datanode.available-spacevolume-choosing-policy. balanced-space-threshold
10737418240 (default)
The amount by which volumes are allowed to differ from each other in terms of bytes of free disk space before they are considered imbalanced. The default is 10737418240 (10 GB). If the free space on each volume is within this range of the other volumes, the volumes will be considered balanced and block assignments will be done on a pure round-robin basis.
What proportion of new block allocations will be sent to volumes with more available disk space than others. The allowable range is 0.0-1.0, but set it in the range 0.5 - 1.0 (that is, 50-100%), since there should be no reason to prefer that volumes with less available disk space receive more block allocations.
Enabling WebHDFS
If you want to use WebHDFS, you must first enable it. To enable WebHDFS:
Note: If you want to use WebHDFS in a secure cluster, you must set additional properties to configure secure WebHDFS. For instructions, see Configure secure WebHDFS.
Configuring LZO
If you have installed LZO, configure it as follows. To configure LZO: Set the following property in core-site.xml:
<property> <name>io.compression.codecs</name> <value> org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec, org.apache.hadoop.io.compress.SnappyCodec </value> </property>
For instructions on configuring a highly available JobTracker, see Configuring High Availability for the JobTracker (MRv1); you need to configure mapred.job.tracker differently in that case, and you must not use the port number.
Property
Configuration File
Description
mapred.job.tracker
conf/mapred-site.xml
If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. See Configuring Ports for CDH4 for the default port. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname
Property
Configuration File
mapred.local.dir
mapred-site.xml on each
TaskTracker
This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local.
After specifying these directories in the mapred-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster. To configure local storage directories for use by MapReduce: In the following instructions, local path examples are used to represent Hadoop parameters. The mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local, and /data/4/mapred/local path examples. Change the path examples to match your configuration. 1. Create the mapred.local.dir local directories:
$ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
The correct owner and permissions of these local directories are: Owner Permissions
mapred:hadoop
drwxr-xr-x
In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir. See the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation for further details.
2. Manually set alternatives on each node to point to that directory, as follows. To manually set the configuration on Red Hat-compatible systems:
$ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50 $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES systems or the alternatives(8) man page On Red Hat-compatible systems.
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
Important: If you create the mapred.system.dir directory in a different location, specify that path in the conf/mapred-site.xml file. When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.
where <user> is the Linux username of each user. Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER
Property
Configuration File
Description
mapreduce.framework.name
conf/mapred-site.xml
If you plan on running YARN, you must set this property to the value of yarn.
<value>yarn</value> </property>
Property
Recommended value
Description
yarn.nodemanager.aux-services mapreduce.shuffle
yarn.resourcemanager. scheduler.address
interface.
yarn.resourcemanager. resource-tracker.address
interface.
yarn.resourcemanager. admin.address
interface.
yarn.resourcemanager. webapp.address
application.
yarn.application.classpath
$HADOOP_CONF_DIR, Classpath for typical applications. $HADOOP_COMMON_HOME/*, $HADOOP_COMMON_HOME/lib/*, $HADOOP_HDFS_HOME/*, $HADOOP_HDFS_HOME/lib/*, $HADOOP_MAPRED_HOME/*, $HADOOP_MAPRED_HOME/lib/*, $YARN_HOME/*, $YARN_HOME/lib/*
Next, you need to specify, create, and assign the correct permissions to the local directories where you want the YARN daemons to store data.
yarn.nodemanager.local-dirs
Specifies the directories where the NodeManager stores its localized files. All of the files required for running a particular YARN application will be put here for the duration of the application run. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, /data/1/yarn/local through /data/N/yarn/local.
yarn.nodemanager.log-dirs
Specifies the directories where the NodeManager stores container log files. Cloudera recommends that this property specify a directory on each of the JBOD mount points; for example, /data/1/yarn/logs through /data/N/yarn/logs.
yarn.nodemanager.remote-app-log-dir
Specifies the directory where logs are aggregated. Set the value to /var/log/hadoop-yarn/apps. See also Step 9.
<value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.nodemanager.local-dirs</name> <value>/data/1/yarn/local,/data/2/yarn/local,/data/3/yarn/local</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/data/1/yarn/logs,/data/2/yarn/logs,/data/3/yarn/logs</value> </property> <property> <description>Where to aggregate logs</description> <name>yarn.nodemanager.remote-app-log-dir</name> <value>/var/log/hadoop-yarn/apps</value> </property>
After specifying these directories in the yarn-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster. In the following instructions, local path examples are used to represent Hadoop parameters. Change the path examples to match your configuration. To configure local storage directories for use by YARN: 1. Create the yarn.nodemanager.local-dirs local directories:
$ sudo mkdir -p /data/1/yarn/local /data/2/yarn/local /data/3/yarn/local /data/4/yarn/local
Here is a summary of the correct owner and permissions of the local directories: Directory Owner Permissions
yarn.nodemanager.local-dirs
yarn:yarn
drwxr-xr-x
yarn.nodemanager.log-dirs
yarn:yarn
drwxr-xr-x
mapreduce.jobhistory.webapp.address historyserver.company.com:19888 The address of the JobHistory Server web application host:port
2. Once HDFS is up and running, you will create this directory and a history subdirectory under it (see Step 8). Alternatively, you can do the following: 1. Configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir in yarn-site.xml. 2. Create these two directories. 3. Set permissions on mapreduce.jobhistory.intermediate-done-dir to 1777. 4. Set permissions on mapreduce.jobhistory.done-dir to 750. If you configure mapreduce.jobhistory.intermediate-done-dir and mapreduce.jobhistory.done-dir as above, you can skip Step 8.
For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES systems or the alternatives(8) man page On Red Hat-compatible systems.
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
Step 8: Create the history Directory and Set Permissions and Owner
This is a subdirectory of the staging directory you configured in Step 4. In this example we're using /user/history. Create it and set permissions as follows:
sudo -u hdfs hadoop fs -mkdir /user/history sudo -u hdfs hadoop fs -chmod -R 1777 /user/history sudo -u hdfs hadoop fs -chown yarn /user/history
You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.
On each NodeManager system (typically the same ones where DataNode service runs):
$ sudo service hadoop-yarn-nodemanager start
To start the MapReduce JobHistory Server On the MapReduce JobHistory Server system:
$ sudo service hadoop-mapreduce-historyserver start
where <user> is the Linux username of each user. Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:
sudo -u hdfs hadoop fs -mkdir /user/$USER sudo -u hdfs hadoop fs -chown $USER /user/$USER
streaming
rumen
N/A
/usr/lib/hadoop-mapreduce/ hadoop-rumen.jar
hadoop examples
/usr/lib/hadoop-0.20-mapreduce/ hadoop-examples.jar
/usr/lib/hadoop-mapreduce/ hadoop-mapreduce-examples.jar
distcp v1
/usr/lib/hadoop-0.20-mapreduce/ hadoop-tools.jar
/usr/lib/hadoop-mapreduce/ hadoop-extras.jar
distcp v2
N/A
/usr/lib/hadoop-mapreduce/ hadoop-distcp.jar
hadoop archives
/usr/lib/hadoop-0.20-mapreduce/ hadoop-tools.jar
/usr/lib/hadoop-mapreduce/ hadoop-archives.jar
Improving Performance
This section provides solutions to some performance problems, and describes configuration best practices. Important: If you are running CDH over 10Gbps Ethernet, improperly set network configuration or improperly applied NIC firmware or drivers can noticeably degrade performance. Work with your network engineers and hardware vendors to make sure that you have the proper NIC firmware, drivers, and configurations in place and that your network performs properly. Cloudera recognizes that network setup and upgrade are challenging problems, and will make best efforts to share any helpful experiences.
Disabling Transparent Hugepage Compaction Most Linux platforms supported by CDH4 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance. Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue. What to do: Note: In the following instructions, defrag_file_pathname depends on your operating system: Red Hat/CentOS: /sys/kernel/mm/redhat_transparent_hugepage/defrag Ubuntu/Debian, OEL, SLES: /sys/kernel/mm/transparent_hugepage/defrag
[always] never means that transparent hugepage compaction is enabled. always [never] means that transparent hugepage compaction is disabled. 2. To disable transparent hugepage compaction, add the following command to /etc/rc.local :
echo never > defrag_file_pathname
You can also disable transparent hugepage compaction interactively (but remember this will not survive a reboot). To disable transparent hugepage compaction temporarily as root:
# echo 'never' > defrag_file_pathname
disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk. You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:
cat /proc/sys/vm/swappiness
On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 0; for example:
# sysctl -w vm.swappiness=0
Improving Performance in Shuffle Handler and IFile Reader As of CDH4.1, the MapReduce shuffle handler and IFile reader use native Linux calls (posix_fadvise(2) and sync_data_range) on Linux systems with Hadoop native libraries installed. The subsections that follow provide details. Shuffle Handler You can improve MapReduce Shuffle Handler Performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer. To enable this feature for YARN, set the mapreduce.shuffle.manage.os.cache property to true (default). To further tune performance, adjust the value of the mapreduce.shuffle.readahead.bytes property. The default value is 4MB. To enable this feature for MRv1, set the mapred.tasktracker.shuffle.fadvise property to true (default). To further tune performance, adjust the value of the mapred.tasktracker.shuffle.readahead.bytes property. The default value is 4MB.
Reduce the interval for JobClient status reports on single node systems The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.
<property> <name>jobclient.progress.monitor.poll.interval</name> <value>10</value> </property>
Tune the JobTracker heartbeat interval Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.
<property> <name>mapreduce.jobtracker.heartbeat.interval.min</name> <value>10</value> </property>
Start MapReduce JVMs immediately The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.
<property> <name>mapred.reduce.slowstart.completed.maps</name> <value>0</value> </property>
Best practices for HDFS Configuration This section indicates changes you may want to make in hdfs-site.xml. Improve Performance for Local Reads
Note: Also known as short-circuit local reads, this capability is particularly useful for HBase and Cloudera Impala. It improves the performance of node-local reads by providing a fast path that is enabled in this case. It requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client.
libhadoop.so is not available if you have installed from a tarball. You must install from an .rpm, .deb, or parcel in order to use short-circuit local reads.
Note: The text _PORT appears just as shown; you do not need to substitute a number. If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root. Tips and Best Practices for Jobs This section describes changes you can make at the job level. Use the Distributed Cache to Transfer the Job JAR Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() and JobConf.setJarByClass() method. To add JARs to the classpath, use -libjars <jar1>,<jar2>, which will copy the local JAR files to HDFS and then use the distributed cache mechanism to make sure they are available on the task nodes and are added to the task classpath. The advantage of this over JobConf.setJar is that if the JAR is on a task node it won't need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS. Note:
-libjars works only if your MapReduce driver uses ToolRunner. If it doesn't, you would need to use
the DistributedCache APIs (Cloudera does not recommend this). For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.
Flume Installation
Flume Installation
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. This guide is for Flume 1.x (specifically the 1.3.0 release). Note: If you are looking for Flume 0.9.x: For information on Flume 0.9.x, see the Flume 0.9.x documentation. To install Flume 0.9.x instead of Flume 1.x, go to http://archive.cloudera.com/cdh4-deprecated. You cannot install both Flume 0.9.x and Flume 1.x together on the same host. The following sections provide more information and instructions: Migrating from Flume 0.9x Packaging Installing a Tarball Installing Packages Configuring Flume Verifying the Installation Running Flume Files Installed by the Packages Supported Sources, Sinks, and Channels Using and On-disk Encrypted File Channel Apache Flume Documentation
Flume Installation
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)
On SLES systems:
$ sudo zypper remove flume
On Ubuntu systems:
$ sudo apt-get remove flume
Flume Packaging
There are currently three packaging options available for installing Flume: Tarball (.tar.gz) RPM packages Debian packages
Flume Installation
For example,
$ cd /usr/local/lib $ sudo tar -zxvf <path_to_flume-ng-1.3.0-cdh4.4.0.tar.gz> $ sudo mv flume-ng-1.3.0-cdh4.4.0.tar.gz flume-ng
2. To complete the configuration of a tarball installation, you must set your PATH variable to include the bin/ subdirectory of the directory where you installed Flume. For example:
$ export PATH=/usr/local/lib/flume-ng/bin:$PATH
The Flume RPM and Debian packages consist of three packages: flume-ng Everything you need to run Flume flume-ng-agent Handles starting and stopping the Flume agent as a service flume-ng-doc Flume documentation All Flume installations require the common code provided by flume-ng. Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands to install Flume. For instructions, see CDH4 Installation.
Flume Installation
To install Flume on Ubuntu and other Debian systems:
$ sudo apt-get install flume-ng
You may also want to enable automatic start-up on boot. To do this, install the Flume agent. To install the Flume agent so Flume starts automatically on boot on Ubuntu and other Debian systems:
$ sudo apt-get install flume-ng-agent
To install the Flume agent so Flume starts automatically on boot on Red Hat-compatible systems:
$ sudo yum install flume-ng-agent
To install the Flume agent so Flume starts automatically on boot on SLES systems:
$ sudo zypper install flume-ng-agent
To install the documentation: To install the documentation on Ubuntu and other Debian systems:
$ sudo apt-get install flume-ng-doc
Flume Configuration
Flume 1.x provides a template configuration file for flume.conf called conf/flume-conf.properties.template and a template for flume-env.sh called conf/flume-env.sh.template. 1. Copy the Flume template property file conf/flume-conf.properties.template to conf/flume.conf, then edit it as appropriate.
$ sudo cp conf/flume-conf.properties.template conf/flume.conf
This is where you define your sources, sinks, and channels, and the flow within an agent. By default, the properties file is configured to work out of the box using a sequence generator source, a logger sink, and a memory channel. For information on configuring agent flows in Flume 1.x, as well as more details about the supported sources, sinks and channels, see the documents listed under Viewing the Flume Documentation.
Flume Installation
2. Optionally, copy the template flume-env.sh file conf/flume-env.sh.template to conf/flume-env.sh.
$ sudo cp conf/flume-env.sh.template conf/flume-env.sh
The flume-ng executable looks for a file named flume-env.sh in the conf directory, and sources it if it finds it. Some use cases for using flume-env.sh are to specify a bigger heap size for the flume agent, or to specify debugging or profiling options via JAVA_OPTS when developing your own custom Flume NG components such as sources and sinks. If you do not make any changes to this file, then you need not perform the copy as it is effectively empty by default.
agent options: --conf-file,-f <file> specify a config file (required) --name,-n <name> the name of this agent (required) --help,-h display help text avro-client options: --rpcProps,-P <file> RPC client properties file with server connection params --host,-H <host> hostname to which events will be sent (required) --port,-p <port> port of the avro source (required) --dirname <dir> directory to stream to avro source --filename,-F <file> text file to stream to avro source [default: std input] --headerFile,-R <file> headerFile containing headers as key/value pairs on each new line --help,-h display help text Either --rpcProps or both --host and --port must be specified. Note that if <conf> directory is specified, then it is always included first in the classpath.
Note: If Flume is not found and you installed Flume from a tarball, make sure that $FLUME_HOME/bin is in your $PATH.
Flume Installation
Running Flume
If Flume is installed via an RPM or Debian package, you can use the following commands to start, stop, and restart the Flume agent via init scripts:
$ sudo service flume-ng-agent <start | stop | restart>
You can also run the agent in the foreground directly by using the flume-ng agent command:
$ /usr/bin/flume-ng agent -c <config-dir> -f <config-file> -n <agent-name>
For example:
$ /usr/bin/flume-ng agent -c /etc/flume-ng/conf -f /etc/flume-ng/conf/flume.conf -n agent
Config Directory
/etc/flume-ng/conf
Config File
/etc/flume-ng/conf/ flume-env.sh.template
If you want modify this file, copy it first and modify the copy
/var/log/flume-ng
/usr/lib/flume-ng
/etc/init.d/flume-ng-agent
Flume Installation
Resource
Location
Notes
/usr/bin/flume-ng
/etc/default/flume-ng-agent
Allows you to specify non-default values for the agent name and for the configuration file location
Sources
Type Description Implementation Class
avro
Avro Netty RPC event source. AvroSource Listens on Avro port and receives events from external Avro streams.
netcat
Netcat style TCP event source. Listens on a given port and turns each line of text into an event.
NetcatSource
seq
SequenceGeneratorSource
exec
ExecSource
syslogtcp
Reads syslog data and generates SyslogTcpSource flume events. Creates a new event for a string of characters separated by carriage return ( \n ).
Flume Installation
Type
Description
Implementation Class
syslogudp
Reads syslog data and generates flume events. Treats an entire message as a single event.
SyslogUDPSource
org.apache.flume.source.avroLegacy. Allows the Flume 1.x agent to AvroLegacySource receive events from Flume 0.9.4 agents over avro rpc.
AvroLegacySource
org.apache.flume.source.thriftLegacy. Allows the Flume 1.x agent to ThriftLegacySource receive events from Flume 0.9.4 agents over thrift rpc.
ThriftLegacySource
org.apache.flume.source.StressSource Mainly for testing purposes. Not StressSource meant for production use. Serves as a continuous source of events where each event has the same payload.
org.apache.flume.source.scribe. ScribeSource
Scribe event source. Listens on ScribeSource Scribe port and receives events from Scribe. Note that in CDH4.2, Scribe Source is at an experimental stage of development and should not be considered production-ready.
multiport_syslogtcp
MultiportSyslogTCPSource
spooldir
Used for ingesting data by placing SpoolDirectorySource files to be ingested into a "spooling" directory on disk.
http
Accepts Flume events by HTTP POST HTTPSource and GET. GET should be used for experimentation only.
org.apache.flume.source.jms.JMSSource Reads messages from a JMS JMSSource destination such as a queue or topic.
org.apache.flume.agent.embedded. Used only by the Flume embedded EmbeddedSource EmbeddedSource agent. See Flume Developer Guide for more details.
Flume Installation
Type
Description
Implementation Class
Other (custom)
You need to specify the fully-qualified name of the custom source, and provide that class (and its dependent code) in Flume's classpath. You can do this by creating a JAR file to hold the custom code, and placing the JAR in Flume's lib directory.
Sinks
Type Description Implementation Class
null
logger
Log events at INFO level via LoggerSink configured logging subsystem (log4j by default)
avro
Sink that invokes a pre-defined Avro AvroSink protocol method for all events it receives (when paired with an avro source, forms tiered collection)
hdfs
Writes all events received to HDFS HDFSEventSink (with support for rolling, bucketing, HDFS-200 append, and more)
file_roll
irc
Takes messages from attached channel and relays those to configured IRC destinations.
IRCSink
org.apache.flume.hbase.HBaseSink A simple sink that reads events from HBaseSink a channel and writes them synchronously to HBase. The AsyncHBaseSink is recommended.
Flume Installation
Type
Description
Implementation Class
org.apache.flume.sink.hbase.AsyncHBaseSink A simple sink that reads events from AsyncHBaseSink a channel and writes them asynchronously to HBase. This is the recommended HBase sink.
org.apache.flume.sink.solr.morphline.MorphlineSolrSink Extracts and transforms data from MorphlineSolrSink Flume events, and loads it into Apache Solr servers. See the section on MorphlineSolrSink in the Flume User Guide listed under Viewing the Flume Documentation on page 131.
Other (custom)
You need to specify the fully-qualified name of the custom sink, and provide that class (and its dependent code) in Flume's classpath. You can do this by creating a JAR file to hold the custom code, and placing the JAR in Flume's lib directory.
Channels
Type Description Implementation Class
memory
jdbc
file
Other (custom)
You need to specify the fully-qualified name of the custom channel, and provide that class (and its dependent code) in Flume's classpath. You can do this by creating a JAR file to hold the custom code, and placing the JAR in Flume's lib directory.
Flume Installation
The command to generate a 128-bit key that uses a different password from that used by the key store is:
keytool -genseckey -alias key-0 -keypass keyPassword -keyalg AES \ -keysize 128 -validity 9000 -keystore test.keystore \ -storetype jceks -storepass keyStorePassword
The key store and password files can be stored anywhere on the file system; both files should have flume as the owner and 0600 permissions. Please note that -keysize controls the strength of the AES encryption key, in bits; 128, 192, and 256 are the allowed values.
Configuration
Flume on-disk encryption is enabled by setting parameters in the /etc/flume-ng/conf/flume.conf file. Basic Configuration The first example is a basic configuration with an alias called key-0 that uses the same password as the key store:
agent.channels.ch-0.type = file agent.channels.ch-0.capacity = 10000 agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING agent.channels.ch-0.encryption.activeKey = key-0 agent.channels.ch-0.encryption.keyProvider = JCEKSFILE agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile =
Flume Installation
In the next example, key-0 uses its own password which may be different from the key store password:
agent.channels.ch-0.type = file agent.channels.ch-0.capacity = 10000 agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING agent.channels.ch-0.encryption.activeKey = key-0 agent.channels.ch-0.encryption.keyProvider = JCEKSFILE agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password agent.channels.ch-0.encryption.keyProvider.keys = key-0 agent.channels.ch-0.encryption.keyProvider.keys.key-0.passwordFile = /path/to/key-0.password
The same scenario except that key-0 and key-1 have their own passwords is shown here:
agent.channels.ch-0.type = file agent.channels.ch-0.capacity = 10000 agent.channels.ch-0.encryption.cipherProvider = AESCTRNOPADDING agent.channels.ch-0.encryption.activeKey = key-1 agent.channels.ch-0.encryption.keyProvider = JCEKSFILE agent.channels.ch-0.encryption.keyProvider.keyStoreFile = /path/to/my.keystore agent.channels.ch-0.encryption.keyProvider.keyStorePasswordFile = /path/to/my.keystore.password agent.channels.ch-0.encryption.keyProvider.keys = key-0 key-1 agent.channels.ch-0.encryption.keyProvider.keys.key-0.passwordFile = /path/to/key-0.password agent.channels.ch-0.encryption.keyProvider.keys.key-1.passwordFile = /path/to/key-1.password
Troubleshooting
If the unlimited strength JCE policy files are not installed, an error similar to the following is printed in the flume.log:
07 Sep 2012 23:22:42,232 ERROR [lifecycleSupervisor-1-0] (org.apache.flume.channel.file.encryption.AESCTRNoPaddingProvider.getCipher:137) Unable to load key using transformation: AES/CTR/NoPadding; Warning: Maximum allowed key length = 128 with the available JCE security policy files. Have you installed the JCE unlimited strength jurisdiction policy files? java.security.InvalidKeyException: Illegal key size at javax.crypto.Cipher.a(DashoA13*..) at javax.crypto.Cipher.a(DashoA13*..) at javax.crypto.Cipher.a(DashoA13*..) at javax.crypto.Cipher.init(DashoA13*..)
Flume Installation
at javax.crypto.Cipher.init(DashoA13*..) at org.apache.flume.channel.file.encryption.AESCTRNoPaddingProvider.getCipher(AESCTRNoPaddingProvider.java:120) at org.apache.flume.channel.file.encryption.AESCTRNoPaddingProvider.access$200(AESCTRNoPaddingProvider.java:35) at o r g . a p a c h e . f l u m e . c h a n n e l . f i l e . e n c r y p t i o n . A E S C T R N o P a d d i n g P r o v i d e r $ A E S C T R N o P a d d i n g D e c r y p t o r . < i n i t > ( A E S C T R N o P a d d i n g P r o v i d e r . j a v a : 9 4 ) at o r g . a p a c h e . f l u m e . c h a n n e l . f i l e . e n c r y p t i o n . A E S C T R N o P a d d i n g P r o v i d e r $ A E S C T R N o P a d d i n g D e c r y p t o r . < i n i t > ( A E S C T R N o P a d d i n g P r o v i d e r . j a v a : 9 1 ) at org.apache.flume.channel.file.encryption.AESCTRNoPaddingProvider$DecryptorBuilder.build(AESCTRNoPaddingProvider.java:66) at org.apache.flume.channel.file.encryption.AESCTRNoPaddingProvider$DecryptorBuilder.build(AESCTRNoPaddingProvider.java:62) at org.apache.flume.channel.file.encryption.CipherProviderFactory.getDecrypter(CipherProviderFactory.java:47) at org.apache.flume.channel.file.LogFileV3$SequentialReader.<init>(LogFileV3.java:257) at org.apache.flume.channel.file.LogFileFactory.getSequentialReader(LogFileFactory.java:110) at org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:258) at org.apache.flume.channel.file.Log.replay(Log.java:339) at org.apache.flume.channel.file.FileChannel.start(FileChannel.java:260) at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:236) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Sqoop Installation
Sqoop Installation
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into the Hadoop Distributed File System (HDFS) or related systems such as Hive and HBase. Conversely, you can use Sqoop to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses. Note: To see which version of Sqoop is shipping in CDH4, check the CDH Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes. See the following sections for information and instructions: Upgrading from CDH3 Upgrading from an Earlier CDH4 release Packaging Prerequisites Installing Packages Installing a Tarball Installing the JDBC Drivers Setting HADOOP_MAPRED_HOME for YARN Apache Sqoop Documentation
Sqoop Installation
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up. To remove Sqoop on a SLES system:
$ sudo zypper remove sqoop
Sqoop Packaging
The packaging options for installing Sqoop are: RPM packages Tarball
Sqoop Installation
Debian packages
Sqoop Prerequisites
An operating system supported by CDH4 Oracle JDK
If you have already configured CDH on your system, there is no further configuration necessary for Sqoop. You can start using Sqoop by using commands such as:
$ sqoop help $ sqoop version $ sqoop import
Sqoop Installation
Important: Make sure you have read and understood How Packaging Affects CDH4 Deployment before you proceed with a tarball installation. To install Sqoop from the tarball, unpack the tarball in a convenient location. Once it is unpacked, add the bin directory to the shell path for easy access to Sqoop commands. Documentation for users and developers can be found in the docs directory. To install the Sqoop tarball on Linux-based systems: Run the following command:
$ (cd /usr/local/ && sudo tar -zxvf _<path_to_sqoop.tar.gz>_)
Note: When installing Sqoop from the tarball package, you must make sure that the environment variables JAVA_HOME and HADOOP_MAPRED_HOME are configured correctly. The variable HADOOP_MAPRED_HOME should point to the root directory of Hadoop installation. Optionally, if you intend to use any Hive or HBase related functionality, you must also make sure that they are installed and the variables HIVE_HOME and HBASE_HOME are configured correctly to point to the root directory of their respective installation.
Note: At the time of publication, version was 5.1.25, but the version may have changed by the time you read this. If you installed Sqoop via Cloudera Manager, using parcels, copy the .jar file to $HADOOP_CLASSPATH instead of /usr/lib/sqoop/lib/.
Sqoop Installation
the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /usr/lib/sqoop/lib/ directory:
$ sudo cp ojdbc6.jar /usr/lib/sqoop/lib/
Sqoop 2 Installation
Sqoop 2 Installation
The following sections describe how to install and configure Sqoop 2: About Sqoop 2 Installing Sqoop 2 Configuring Sqoop 2 Starting, Stopping and Using the Server Apache Documentation
About Sqoop 2
Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS. To find out more about Sqoop 2, see Viewing the Sqoop 2 Documentation. To install Sqoop 2, follow the directions on this page.
Sqoop 2 Packaging
There are three packaging options for installing Sqoop 2: Tarball (.tgz) that contains both the Sqoop 2 server and the client. Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client) Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
Installing Sqoop 2
Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package (sqoop2-server). Install the server package on one node in the cluster; because the Sqoop 2 server acts as a MapReduce client this node must have Hadoop installed and configured. Install the client package on each node that will act as a client. A Sqoop 2 client will always connect to the Sqoop 2 server to perform any actions, so Hadoop does not need to be installed on the client nodes. Depending on what you are planning to install, choose the appropriate package and install it using your preferred package manager application. Note: The Sqoop 2 packages can't be installed on the same machines as Sqoop1 packages. However you can use both versions in the same Hadoop cluster by installing Sqoop1 and Sqoop 2 on different nodes. To install the Sqoop 2 server package on a Red-Hat-compatible system:
Sqoop 2 Installation
Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands to install Sqoop 2. For instructions, see CDH4 Installation.
Note: Installing the sqoop2-server package creates a sqoop-server service configured to start Sqoop 2 at system startup time. You are now ready to configure Sqoop 2. See the next section.
Configuring Sqoop 2
This section explains how to configure the Sqoop 2 server.
Sqoop 2 Installation
Example /etc/defaults/sqoop2-server content to work with MRv1:
CATALINA_BASE=/usr/lib/sqoop2/sqoop-server-0.20
Note:
/etc/defaults/sqoop2-server is loaded only once when the Sqoop 2 server starts. You must
Note: At the time of publication, version was 5.1.25, but the version may have changed by the time you read this. Installing the Oracle JDBC Driver You can download the JDBC Driver from the Oracle website, for example http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /var/lib/sqoop2/ directory:
$ sudo cp ojdbc6.jar /var/lib/sqoop2/
Installing the Microsoft SQL Server JDBC Driver Download the Microsoft SQL Server JDBC driver from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774 and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz $ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/
Installing the PostgreSQL JDBC Driver Download the PostgreSQL JDBC driver from http://jdbc.postgresql.org/download.html and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar $ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/
Sqoop 2 Installation
You should get a text fragment in JSON format similar to the following:
{"version":"1.99.1-cdh4.4.0",...}
Identify the host where your server is running (we will use localhost in this example):
sqoop:000> set server --host localhost
Test the connection by running the command show version --all to obtain the version number from server. You should see output similar to the following:
sqoop:000> show version --all server version: Sqoop 1.99.1-cdh4.4.0 revision ... Compiled by jenkins on ... client version: Sqoop 1.99.1-cdh4.4.0 revision ... Compiled by jenkins on ... Protocol version: [1]
Sqoop 2 Installation
Hue Installation
Hue Installation
Hue is a suite of applications that provide web-based access to CDH components and a platform for building custom applications. The following figure illustrates how Hue works. Hue Server is a "container" web application that sits in between your CDH installation and the browser. It hosts the Hue applications and communicates with various servers that interface with CDH components.
The Hue Server uses a database to manage session, authentication, and Hue application data. For example, the Job Designer application stores job designs in the database. Some Hue applications run Hue-specific daemon processes. For example, Beeswax runs a daemon (Beeswax Server) that keeps track of query states. Hue applications communicate with these daemons either by using Thrift or by exchanging state through the database. In a CDH cluster, the Hue Server runs on a special node. For optimal performance, this should be one of the nodes within your cluster, though it can be a remote node as long as there are no overly restrictive firewalls. For small clusters of less than 10 nodes, you can use your existing master node as the Hue Server. In a pseudo-distributed installation, the Hue Server runs on the same machine as the rest of your CDH services. Note: Install Cloudera Repository Before using the instructions on this page to install or upgrade, install the Cloudera yum, zypper/YaST or apt repository, and install or upgrade CDH4 and make sure it is functioning correctly. For instructions, see CDH4 Installation and the instructions for upgrading to CDH4 or upgrading from an earlier CDH4 release.
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) Follow the instructions in the following sections to upgrade, install, configure, and administer Hue. Supported Browsers Upgrading Hue Installing Hue
Hue Installation
Configuring CDH Components for Hue Configuring Your Firewall for Hue Configuring Hue Starting and Stopping the Server Administering Hue Hue User Guide
Supported Browsers
The Hue UI is supported on the following browsers: Windows: Chrome, Firefox 3.6+, Internet Explorer 8+, Safari 5+ Linux: Chrome, Firefox 3.6+ Mac: Chrome, Firefox 3.6+, Safari 5+
Upgrading Hue
Note: To see which version of Hue is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes.
On SLES systems:
$ sudo zypper remove hue
Hue Installation
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 3: Install Hue 2.x Follow the instructions under Installing Hue. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
Step 4: Start the Hue Server See Starting and Stopping the Hue Server.
Upgrading Hue from an Earlier CDH4 Release to the Latest CDH4 Release
You can upgrade Hue either as part of an overall upgrade to the latest CDH4 release (see Upgrading from an Earlier CDH4 Release) or independently. To upgrade Hue from an earlier CDH4 release to the latest CDH4 release, proceed as follows. Step 1: Stop the Hue Server See Starting and Stopping the Hue Server. Warning: You must stop Hue. If Hue is running during the upgrade, the new version will not work correctly.
Step 2: Install the New Version of Hue Follow the instructions under Installing Hue. Important: During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.
Hue Installation
Step 3: Start the Hue Server See Starting and Stopping the Hue Server.
Installing Hue
This section describes Hue installation and configuration on a cluster. The steps in this section apply whether you are installing on a single machine in pseudo-distributed mode, or on a cluster.
For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the hue-plugins package:
$ sudo yum install hue-plugins
On SLES systems: On the Hue Server machine, install the hue meta-package and the hue-server package:
$ sudo zypper install hue hue-server
For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the hue-plugins package:
$ sudo zypper install hue-plugins
On Ubuntu or Debian systems: On the Hue Server machine, install the hue meta-package and the hue-server package:
$ sudo apt-get install hue hue-server
For MRv1: on the system that hosts the JobTracker, if different from the Hue server machine, install the hue-plugins package:
$ sudo apt-get install hue-plugins
Hue Installation
Hue Dependencies The following table shows the components that are dependencies for the different Hue applications and provides links to the installation guides for the required components that are not installed by default. Component Required Applications
HDFS
Yes
MRv1
No
YARN
No
Oozie
Yes
Hive
Yes
Cloudera Impala
No
Cloudera Impala UI
HBase
No
Pig
No
Cloudera Search
No
Solr Search
Sqoop
No
Oozie
No
Sqoop, Shell
The Beeswax Server writes into a local directory on the Hue machine that is specified by hadoop.tmp.dir to unpack its JARs. That directory needs to be writable by the hue user, which is the default user who starts Beeswax Server, or else Beeswax Server will not start. You may also make that directory world-writable. For more information, see hadoop.tmp.dir.
Hue Installation
b. Restart your HDFS cluster. 2. Configure Hue as a proxy user for all other users and groups, meaning it may submit a request on behalf of any other user: WebHDFS: Add to core-site.xml:
<!-- Hue WebHDFS proxy user setting --> <property> <name>hadoop.proxyuser.hue.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hue.groups</name> <value>*</value> </property>
Hue Installation
If the configuration is not present, add it to /etc/hadoop-httpfs/conf/httpfs-site.xml and restart the HttpFS daemon. 3. Verify that core-site.xml has the following configuration:
<property> <name>hadoop.proxyuser.httpfs.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.httpfs.groups</name> <value>*</value> </property>
If the configuration is not present, add it to /etc/hadoop/conf/core-site.xml and restart Hadoop. 4. With root privileges, update hadoop.hdfs_clusters.default.webhdfs_url in hue.ini to point to the address of either WebHDFS or HttpFS.
[hadoop] [[hdfs_clusters]] [[[default]]] # Use WebHdfs/HttpFs as the communication mechanism.
WebHDFS:
... webhdfs_url=http://FQDN:50070/webhdfs/v1/
HttpFS:
... webhdfs_url=http://FQDN:14000/webhdfs/v1/
Note: If the webhdfs_url is uncommented and explicitly set to the empty value, Hue falls back to using the Thrift plugin used in Hue 1.x. This is not recommended.
MRv1 Configuration
Hue communicates with the JobTracker via the Hue plugin, which is a .jar file that you place in your MapReduce lib directory. If your JobTracker and Hue Server are located on the same host, copy the file over. If you are currently using CDH3, your MapReduce library directory might be in /usr/lib/hadoop/lib.
$ cd /usr/share/hue $ cp desktop/libs/hadoop/java-lib/hue-plugins-*.jar /usr/lib/hadoop-0.20-mapreduce/lib
If your JobTracker runs on a different host, scp the Hue plugins .jar file to the JobTracker host. Add the following properties to mapred-site.xml:
<property> <name>jobtracker.thrift.address</name> <value>0.0.0.0:9290</value> </property> <property> <name>mapred.jobtracker.plugins</name> <value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value> <description>Comma-separated list of jobtracker plug-ins to be activated.</description> </property>
Hue Installation
You can confirm that the plugins are running correctly by tailing the daemon logs:
$ tail --lines=500 /var/log/hadoop-0.20-mapreduce/hadoop*jobtracker*.log | grep ThriftPlugin 2009-09-28 16:30:44,337 INFO org.apache.hadoop.thriftfs.ThriftPluginServer: Starting Thrift server 2009-09-28 16:30:44,419 INFO org.apache.hadoop.thriftfs.ThriftPluginServer: Thrift server listening on 0.0.0.0:9290
Note: If you enable ACLs in the JobTracker, you must add users to the JobTracker mapred.queue.default.acl-administer-jobs property in order to allow Hue to display jobs in the Job Browser application. For example, to give the hue user access to the JobTracker, you would add the following property:
<property> <name>mapred.queue.default.acl-administer-jobs</name> <value>hue</value> </property>
Repeat this for every user that requires access to the job details displayed by the JobTracker. If you have any mapred queues besides "default", you must add a property for each queue:
<property> <name>mapred.queue.default.acl-administer-jobs</name> <value>hue</value> </property> <property> <name>mapred.queue.queue1.acl-administer-jobs</name> <value>hue</value> </property> <property> <name>mapred.queue.queue2.acl-administer-jobs</name> <value>hue</value> </property>
Oozie Configuration
In order to run DistCp, Streaming, Pig, Sqoop, and Hive jobs in Job Designer or the Oozie Editor/Dashboard application, you must make sure the Oozie shared libraries are installed for the correct version of MapReduce (MRv1 or YARN). See Installing the Oozie ShareLib in Hadoop HDFS for instructions.
Hive Configuration
The Beeswax application helps you use Hive to query your data and depends on a Hive installation on your system. The Cloudera Impala application also depends on Hive. Note: When using Beeswax and Hive configured with the embedded metastore which is the default with Hue, the metastore DB should be owned by Hue (recommended) or writable to everybody:
sudo chown hue:hue -R /var/lib/hive/metastore/metastore_db sudo chmod -R 777 /var/lib/hive/metastore/metastore_db
If not, Beeswax won't start and Hue Beeswax app will show 'Exception communicating with Hive Metastore Server at localhost:8003'
Hue Installation
Permissions See File System Permissions in the Hive Installation section. No Existing Hive Installation Familiarize yourself with the configuration options in hive-site.xml. See Hive Installation. Having a hive-site.xml is optional but often useful, particularly on setting up a metastore. You can instruct Beeswax to locate it using the hive_conf_dir configuration variable. Existing Hive Installation In the Hue configuration file hue.ini, modify hive_conf_dir to point to the directory containing hive-site.xml.
Incorrect:
# HADOOP_CLASSPATH=<your_additions>
This enables certain components of Hue to add to Hadoop's classpath using the environment variable. hadoop.tmp.dir If your users are likely to be submitting jobs both using Hue and from the same machine via the command line interface, they will be doing so as the hue user when they are using Hue and via their own user account when they are using the command line. This leads to some contention on the directory specified by hadoop.tmp.dir, which defaults to /tmp/hadoop-${user.name}. Specifically, hadoop.tmp.dir is used to unpack JARs in /usr/lib/hadoop. One work around to this is to set hadoop.tmp.dir to /tmp/hadoop-${user.name}-${hue.suffix} in the core-site.xml file:
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}-${hue.suffix}</value> </property>
Unfortunately, when the hue.suffix variable is unset, you'll end up with directories named /tmp/hadoop-user.name-${hue.suffix} in /tmp. Despite that, Hue will still work. Important: The Beeswax Server writes into a local directory on the Hue machine that is specified by hadoop.tmp.dir to unpack its jars. That directory needs to be writable by the hue user, which is the default user who starts Beeswax Server, or else Beeswax Server will not start. You may also make that directory world-writable.
Hue Installation
Hue Configuration
This section describes configuration you perform in the Hue configuration file hue.ini. The location of the Hue configuration file varies depending on how Hue is installed. The location of the configuration file is displayed when you view the Hue configuration. Note: Only the root user can edit the Hue configuration file.
Hue Installation
Hue defaults to using the Spawning web server, which is necessary for the Shell application. To revert to the CherryPy web server, use the following setting in the Hue configuration file:
use_cherrypy_server=true
Setting this to false causes Hue to use the Spawning web server. Specifying the Secret Key For security, you should specify the secret key that is used for secure hashing in the session store: 1. Open the Hue configuration file. 2. In the [desktop] section, set the secret_key property to a long series of random characters (30 to 60 characters is recommended). For example,
secret_key=qpbdxoewsqlkhztybvfidtvwekftusgdlofbcfghaswuicmqp
Note: If you don't specify a secret key, your session cookies will not be secure. Hue will run but it will also display error messages telling you to set the secret key.
Authentication By default, the first user who logs in to Hue can choose any username and password and automatically becomes an administrator. This user can create other user and administrator accounts. Hue users should correspond to the Linux users who will use Hue; make sure you use the same name as the Linux username. By default, user information is stored in the Hue database. However, the authentication system is pluggable. You can configure authentication to use an LDAP directory (Active Directory or OpenLDAP) to perform the authentication, or you can import users and groups from an LDAP directory. See Configuring an LDAP Server for User Admin on page 162. For more information, see Hue SDK. Configuring the Hue Server for SSL You can optionally configure Hue to serve over HTTPS. To do so, you must install pyOpenSSL within Hue's context and configure your keys. To install pyOpenSSL, be sure that gcc, python-devel, and either libssl-dev (Debian and Ubuntu) or openssl-devel (RHEL, CentOS, and SLES) are installed, then perform the following steps from the root of your Hue installation path:
Hue Installation
1. Do one of the following depending on whether your Hue node is connected to the Internet: Connected to Internet Run
$ sudo -H -u hue ./build/env/bin/easy_install pyOpenSSL
Not Connected to Internet Download https://launchpad.net/pyopenssl/main/0.11/+download/pyOpenSSL-0.11.tar.gz. Then move the tarball to your Hue node and run
$ sudo -H -u hue ./build/env/bin/easy_install pyOpenSSL-0.11.tar.gz
2. Configure Hue to use your private key by adding the following options to the Hue configuration file:
ssl_certificate=/path/to/certificate ssl_private_key=/path/to/key
3. On a production system, you should have an appropriate key signed by a well-known Certificate Authority. If you're just testing, you can create a self-signed key using the openssl command that may be installed on your system:
# $ # $ Create a key openssl genrsa 1024 > host.key Create a self-signed certificate openssl req -new -x509 -nodes -sha1 -key host.key > host.cert
Note: Uploading files using the Hue File Browser over HTTPS requires using a proper SSL Certificate. Self-signed certificates don't work.
Beeswax Configuration
In the [beeswax] section of the configuration file, you can optionally specify the following:
beeswax_server_host
The hostname or IP address of the Hive server. Default: localhost, and therefore only serves local IPC clients.
beeswax_server_port
The port of the Hive server. If server_interface is set to hiveserver2, this should be set to the port that HiveServer2 is running on, which defaults to 10000. Default: 8002.
hive_home_dir
hive_conf_dir
Hue Installation
beeswax_server_heapsize
server_interface
The type of the Hive server that the application uses: beeswax hiveserver2 Default: beeswax.
By default, Beeswax allows any user to see the saved queries of all other Beeswax users. You can restrict this by changing the setting the following property:
share_saved_queries
Set to false to restrict viewing of saved queries to the owner of the query or an administrator.
server_port
The port of the Impalad Server. Default: When using the beeswax interface, 21000. When using the HiveServer2 interface, 21050.
server_interface
The type of interface to use to communicate with Impalad Server: beeswax hiveserver2 Default: hiveserver2.
Sqoop Configuration
In the [sqoop] section of the configuration file, you can optionally specify the following:
Hue Installation
server_url
Indicate that jobs should be shared with all users. If set to false, they will be visible only to the owner and administrators.
Job Designer
In the [jobsub] section of the configuration file, you can optionally specify the following:
remote_data_dir
Location in HDFS where the Job Designer examples and templates are stored.
Indicate that workflows, coordinators, and bundles should be shared with all users. If set to false, they will be visible only to the owner and administrators.
oozie_jobs_count
Maximum number of Oozie workflows or coordinators or bundles to retrieve in one API call.
remote_data_dir
Search Configuration
In the [search] section of the configuration file, you can optionally specify the following:
security_enabled
Hue Installation
empty_query
solr_url
Hue Shell Configuration Properties To add or remove shells, modify the Hue Shell configuration in the [shell] section of the Hue configuration file:
shell_buffer_amount
Optional. Amount of output to buffer for each shell in bytes. Defaults to 524288 (512 KiB) if not specified.
shell_timeout
Optional. Amount of time to keep shell subprocesses open when no open browsers refer to them. Defaults to 600 seconds if not specified.
Hue Installation
shell_write_buffer_limit
Optional. Amount of pending commands to buffer for each shell, in bytes. Defaults to 10000 (10 KB) if not specified.
shell_os_read_amount
Optional. Number of bytes to specify to the read system call when reading subprocess output. Defaults to 40960 (usually 10 pages, since pages are usually 4096 bytes) if not specified.
shell_delegation_token_dir
Optional. If this instance of Hue is running with a Hadoop cluster with Kerberos security enabled, it must acquire the appropriate delegation tokens to execute subprocesses securely. The value under this key specifies the directory in which these delegation tokens are to be stored. Defaults to /tmp/hue_shell_delegation_tokens if not specified.
[[ shelltypes ]]
This section title is a key name that also begins the configuration parameters for a specific shell type ("pig" in this example). You can use any name, but it must be unique for each shell specified in the configuration file. Each key name denotes the beginning of a shell configuration section; each section can contain the following six parameters described in this table: nice_name, command, help, environment, and the value for the environment variable.
The command to run to start the specified shell. The path to the binary must be an absolute path.
Optional. A section to specify environment variables to be set for subprocesses of this shell type. Each environment variable is itself another sub-section, as described below.
The name of the environment variable to set. For example, Pig requires JAVA_HOME to be set.
value = /usr/lib/jvm/java-6-sun
Hue Installation
Restrictions While almost any process that exports a command-line interface can be included in the Shell application, processes that redraw the window, such as vim or top, cannot be exposed in this way. Unix User Accounts To properly isolate subprocesses so as to guarantee security, each Hue user who is using the Shell subprocess must have a Unix user account. The link between Hue users and Unix user accounts is the username, and so every Hue user who wants to use the Shell application must have a Unix user account with the same name on the server that runs Hue. Also, there is a binary called setuid which provides a binary wrapper, allowing for the subprocess to be run as the appropriate user. In order to work properly for all users of Hue, this binary must be owned by root and must have the setuid bit set. To make sure that these two requirements are satisfied, navigate to the directory with the setuid binary (apps/shell/src/shell/build) and execute one of the following commands in a terminal: OS sudo Command
$ sudo chown root:hue setuid $ sudo chmod 4750 setuid $ exit
root
$ # # #
Important: If you are running Hue Shell against a secure cluster, see Running Hue Shell against a Secure Cluster for security configuration information for Hue Shell. Hue Server Configuration Older versions of Hue shipped with the CherryPy web server as the default Hue server. This is no longer the case starting with CDH3 Update 1. In order to configure the default Hue server, you must modify the Hue configuration file and modify the value for use_cherrypy_server. This value must either be set to false or not specified in order for the Shell application to work.
HBase Configuration
In the [hbase] section of the configuration file, you can optionally specify the following:
truncate_limit
Hard limit of rows or columns per row fetched before truncating. Default: 500
hbase_clusters
Comma-separated list of HBase Thrift servers for clusters in the format of "(name|host:port)". Default: (Cluster|localhost:9090)
Hue Installation
The name of the group to which a manually created user is automatically assigned. Default: default.
base_dn
nt_domain
nt_domain=mycompany.com
ldap_url
ldap_url=ldap://auth.mycompany.com
ldap_cert
Hue Installation
Property
Description
Example
bind_dn
Distinguished name of the user to bind as bind_dn="CN=ServiceAccount,DC=mycompany,DC=com" not necessary if the LDAP server supports anonymous searches.
2. Configure the following properties in the [[[users]]] section: Property Description Example
user_filter="objectclass=*"
3. Configure the following properties in the [[[groups]]] section: Property Description Example
group_filter="objectclass=*"
Note: If you provide a TLS certificate, it must be signed by a Certificate Authority that is trusted by the LDAP server. Enabling the LDAP Server for User Authentication You can configure User Admin to use an LDAP server as the authentication back end, which means users logging in to Hue will authenticate to the LDAP server, rather than against usernames and passwords managed by User Admin.
Hue Installation
Important: Be aware that when you enable the LDAP back end for user authentication, user authentication by User Admin will be disabled. This means there will be no superuser accounts to log into Hue unless you take one of the following actions: Import one or more superuser accounts from Active Directory and assign them superuser permission. If you have already enabled the LDAP authentication back end, log into Hue using the LDAP back end, which will create a LDAP user. Then disable the LDAP authentication back end and use User Admin to give the superuser permission to the new LDAP user. After assigning the superuser permission, enable the LDAP authentication back end. To enable the LDAP server for user authentication: 1. In the Hue configuration file, configure the following properties in the [[ldap]] section: Property Description Example
ldap_url
nt_domain
The NT domain over which the user connects nt_domain=mycompany.com (not strictly necessary if using ldap_username_pattern.
l d a p _ u s e r n a m e _ p a t t e r n Pattern for searching for usernames Use <username> for the username parameter. For use when using LdapBackend for Hue
authentication
2. If you are using TLS or secure ports, add the following property to specify the path to a TLS certificate file: Property
ldap_cert
Description
Example
Path to certificate for authentication over TLS. Note: If you provide a TLS certificate, it must be signed by a Certificate Authority that is trusted by the LDAP server.
ldap_cert=/mycertsdir/myTLScert
Hue Installation
to
backend=desktop.auth.backend.LdapBackend
Hadoop Configuration
The following configuration variables are under the [hadoop] section in the Hue configuration file. HDFS Cluster Configuration Hue currently supports only one HDFS cluster, which you define under the [[hdfs_clusters]] sub-section. The following properties are supported:
[[[default]]]
fs_defaultfs
webhdfs_url
The HttpFS URL. The default value is the HTTP port on the NameNode.
hadoop_hdfs_home
The home of your Hadoop HDFS installation. It is the root of the Hadoop untarred directory, or usually /usr/lib/hadoop-hdfs.
hadoop_bin
hadoop_conf_dir
MapReduce (MRv1) and YARN (MRv2) Cluster Configuration Job Browser can display both MRv1 and MRv2 jobs, but must be configured to display one type at a time by specifying either [[mapred_clusters]] or [[yarn_clusters]] sections in the Hue configuration file. The following MapReduce cluster properties are defined under the [[mapred_clusters]] sub-section:
[[[default]]]
jobtracker_host
jobtracker_port
Hue Installation
submit_to
If your Oozie is configured with to use a 0.20 MapReduce service, then set this to true. Indicate that Hue should submit jobs to this MapReduce cluster.
hadoop_mapred_home
The home directory of the Hadoop MapReduce installation. For CDH packages, the root of the Hadoop MRv1 untarred directory, the root of the Hadoop 2.0 untarred directory, /usr/lib/hadoop-mapreduce (for MRv1). If submit_to is true, the $HADOOP_MAPRED_HOME for the Beeswax Server and child shell processes.
hadoop_bin
hadoop_conf_dir
The following YARN cluster properties are defined under the under the [[yarn_clusters]] sub-section:
[[[default]]]
resourcemanager_host
resourcemanager_port
submit_to
If your Oozie is configured to use a YARN cluster, then set this to true. Indicate that Hue should submit jobs to this YARN cluster.
hadoop_mapred_home
The home of the Hadoop MapReduce installation. For CDH packages, the root of the Hadoop 2.0 untarred directory, /usr/lib/hadoop-mapreduce (for YARN). If submit_to is true, the $HADOOP_MAPRED_HOME for the Beeswax Server and child shell processes.
hadoop_bin
hadoop_conf_dir
Liboozie Configuration
In the [liboozie] section of the configuration file, you can optionally specify the following:
Hue Installation
security_enabled
remote_deployement_dir
The location in HDFS where the workflows and coordinators are deployed when submitted by a non-owner.
oozie_url
Administering Hue
The following sections contain details about managing and operating a Hue installation.
Processes
A script called supervisor manages all Hue processes. The supervisor is a watchdog process; its only purpose is to spawn and monitor other processes. A standard Hue installation starts and monitors the following processes: runcpserver a web server that provides the core web functionality of Hue beeswax server a daemon that manages concurrent Hive queries If you have installed other applications into your Hue instance, you may see other daemons running under the supervisor as well. You can see the supervised processes running in the output of ps -f -u hue. Note that the supervisor automatically restarts these processes if they fail for any reason. If the processes fail repeatedly within a short time, the supervisor itself shuts down.
Logs
You can view the Hue logs in the /var/log/hue directory, where you can find: An access.log file, which contains a log for all requests against the Hue Server. A supervisor.log file, which contains log information for the supervisor process. A supervisor.out file, which contains the stdout and stderr for the supervisor process.
Hue Installation
A .log file for each supervised process described above, which contains the logs for that process. A .out file for each supervised process described above, which contains the stdout and stderr for that process. If users on your cluster have problems running Hue, you can often find error messages in these log files. Viewing Recent Log Messages In addition to logging INFO level messages to the logs directory, the Hue Server keeps a small buffer of log messages at all levels in memory. The DEBUG level messages can sometimes be helpful in troubleshooting issues. In the Hue UI you can view these messages by selecting the Server Logs tab in the About application. You can also view these logs by visiting http://myserver:port/logs.
Hue Database
The Hue server requires an SQL database to store small amounts of data, including user account information as well as history of job submissions and Hive queries. The Hue server supports a lightweight embedded database and several types of external databases. If you elect to configure Hue to use an external database, upgrades may require more manual steps.
Embedded Database
By default, Hue is configured to use the embedded database SQLite for this purpose, and should require no configuration or management by the administrator. Inspecting the Embedded Hue Database The default SQLite database used by Hue is located in /usr/share/hue/desktop/desktop.db. You can inspect this database from the command line using the sqlite3 program. For example:
# sqlite3 /usr/share/hue/desktop/desktop.db SQLite version 3.6.22 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> select username from auth_user; admin test sample sqlite>
Important: It is strongly recommended that you avoid making any modifications to the database directly using sqlite3, though sqlite3 is useful for management or troubleshooting. Backing up the Embedded Hue Database If you use the default embedded SQLite database, copy the desktop.db file to another node for backup. It is recommended that you back it up on a regular schedule, and also that you back it up before any upgrade to a new version of Hue.
External Database
Although SQLite is the default database, some advanced users may prefer to have Hue access an external database. Hue supports MySQL, PostgreSQL, and Oracle. See Supported Databases for the supported versions.
Hue Installation
Note: In the instructions that follow, dumping the database and editing the JSON objects is only necessary if you have data in SQLite that you need to migrate. If you don't need to migrate data from SQLite, you can skip those steps. Configuring the Hue Server to Store Data in MySQL 1. Shut down the Hue server if it is running. 2. Dump the existing database data to a text file. Note that using the .json extension is required.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue dumpdata > <some-temporary-file>.json
3. Open <some-temporary-file>.json and remove all JSON objects with useradmin.userprofile in the model field. 4. Start the Hue server. 5. Install the MySQL client developer package. OS RHEL Command
$ sudo yum install mysql-devel
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
SLES
Ubuntu or Debian
Hue Installation
8. Change the /etc/my.cnf file as follows:
[mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock bind-address=<ip-address> default-storage-engine=MyISAM
10. Configure MySQL to use a strong password. In the following procedure, your current root password is blank. Press the Enter key when you're prompted for the root password.
$ sudo /usr/bin/mysql_secure_installation [...] Enter current password for root (enter for none): OK, successfully used password, moving on... [...] Set root password? [Y/n] y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...] Disallow root login remotely? [Y/n] N [...] Remove test database and access to it [Y/n] Y [...] Reload privilege tables now? [Y/n] Y All done!
3:on
4:on
5:on
SLES
Ubuntu or Debian
12. Create the Hue database and grant privileges to a hue user to manage the database.
mysql> create database hue; Query OK, 1 row affected (0.01 sec) mysql> grant all on hue.* to 'hue'@'localhost' identified by '<secretpassword>'; Query OK, 0 rows affected (0.00 sec)
Hue Installation
13. Open the Hue configuration file in a text editor. 14. Directly below the [database] section under the [desktop] line, add the following options (and modify accordingly for your MySQL setup):
host=localhost port=3306 engine=mysql user=hue password=<secretpassword> name=hue
15. As the hue user, load the existing data and create the necessary database tables.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue syncdb --noinput $ mysql -uhue -p<secretpassword> mysql > SHOW CREATE TABLE auth_permission;
Configuring the Hue Server to Store Data in PostgreSQL 1. Shut down the Hue server if it is running. 2. Dump the existing database data to a text file. Note that using the .json extension is required.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue dumpdata > <some-temporary-file>.json
3. Open <some-temporary-file>.json and remove all JSON objects with useradmin.userprofile in the model field. 4. Install required packages. OS RHEL Command
$ sudo yum install postgresql-devel gcc python-devel
SLES
Ubuntu or Debian
Hue Installation
5. Install the module that provides the connector to PostgreSQL.
sudo -u hue <HUE_HOME>/build/env/bin/pip install setuptools sudo -u hue <HUE_HOME>/build/env/bin/pip install psycopg2
SLES
Ubuntu or Debian
8. Configure client authentication. a. Edit /var/lib/pgsql/data/pg_hba.conf. b. Set the authentication methods for local to trust and for host to password and add the following line at the end.
host hue hue 0.0.0.0/0 md5
10. Configure PostgreSQL to listen on all network interfaces. Edit /var/lib/pgsql/data/postgresql.conf and set list_addresses:
listen_addresses = 0.0.0.0 # Listen on all addresses
11. Create the hue database and grant privileges to a hue user to manage the database.
# psql -U postgres postgres=# create database hue; postgres=# \c hue; You are now connected to database 'hue'. postgres=# create user hue with password '<secretpassword>'; postgres=# grant all privileges on database hue to hue; postgres=# \q
Hue Installation
13. Verify connectivity.
psql h localhost U hue d hue Password for user hue: <secretpassword>
3:on
4:on
5:on
SLES
Ubuntu or Debian
15. Open the Hue configuration file in a text editor. 16. Directly below the [database] section under the [desktop] line, add the following options (and modify accordingly for your PostgreSQL setup).
host=localhost port=5432 engine=postgresql_psycopg2 user=hue password=<secretpassword> name=hue
17. As the hue user, configure Hue to load the existing data and create the necessary database tables.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue syncdb --noinput
19. Drop the foreign key that you retrieved in the previous step.
postgres=# ALTER TABLE auth_permission DROP CONSTRAINT content_type_id_refs_id_<XXXXXX>;
Hue Installation
Configuring the Hue Server to Store Data in Oracle 1. Ensure Python 2.6 or newer is installed on the server Hue is running on. 2. Download the Oracle client libraries at Instant Client for Linux x86-64 Version 11.1.0.7.0, Basic and SDK (with headers) zip files to the same directory. 3. Unzip the zip files. 4. Set environment variables to reference the libraries.
$ export ORACLE_HOME=<download directory> $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ORACLE_HOME
9. Open the Hue configuration file in a text editor. 10. Directly below the [database] section under the [desktop] line, add the following options (and modify accordingly for your Oracle setup):
host=localhost port=1521 engine=oracle user=hue password=<secretpassword> name=XE
11. As the hue user, configure Hue to load the existing data and create the necessary database tables.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue syncdb --noinput
Hue Installation
15. Load the data.
$ sudo -u hue <HUE_HOME>/build/env/bin/hue loaddata <some-temporary-file>.json
Pig Installation
Pig Installation
Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster. Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands to install or upgrade Pig. For instructions, see CDH4 Installation. Use the following sections to install or upgrade Pig: Upgrading Pig Installing Pig Using Pig with HBase Installing DataFu Apache Pig Documentation
Upgrading Pig
Note: To see which version of Pig is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the Release Notes.
Step 1: Remove Pig 1. Exit the Grunt shell and make sure no Pig scripts are running. 2. Remove the CDH3 version of Pig To remove Pig On Red Hat-compatible systems:
$ sudo yum remove hadoop-pig
Pig Installation
To remove Pig on Ubuntu and other Debian systems:
$ sudo apt-get purge hadoop-pig
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 2: Install the new version Follow the instructions in the next section, Installing Pig. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
Incompatible Changes as of the Pig 0.7.0 Release Pig 0.7.0 contained several changes that are not backward-compatible with versions prior to 0.7.0; if you have scripts from a version of Pig prior to 0.7.0, you may need to modify your user-defined functions (UDFs) so that they work with the current Pig release. In particular, the Load and Store functions were changed. For information about updating your UDFs, see LoadStoreMigrationGuide and Pig070LoadStoreHowTo. For a list of all backward-incompatible changes, see this page.
Pig Installation
Installing Pig
To install Pig On Red Hat-compatible systems:
$ sudo yum install pig
Note: Pig automatically uses the active Hadoop configuration (whether standalone, pseudo-distributed mode, or distributed). After installing the Pig package, you can start the grunt shell. To start the Grunt Shell (MRv1): Use commands similar to the following, replacing the <component_version> strings with the current HBase, ZooKeeper and CDH versions.
$ export PIG_CONF_DIR=/usr/lib/pig/conf $ export P I G _ C L A S S P A T H = / u s r / l i b / h b a s e / h b a s e < H B a s e _ v e r s i o n > c d h < C D H _ v e r s i o n > s e c u r i t y . j a r : / u s r / l i b / z o o k e e p e r / z o o k e e p e r < Z o o K e e p e r _ v e r s i o n > c d h < C D H _ v e r s i o n > . j a r $ pig 2012-02-08 23:39:41,819 [main] INFO org.apache.pig.Main - Logging error messages to: /home/arvind/pig-0.9.2-cdh4b1/bin/pig_1328773181817.log 2012-02-08 23:39:41,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/ ... grunt>
Here's an example.
$ export PIG_CONF_DIR=/usr/lib/pig/conf $ export PIG_CLASSPATH=/usr/lib/hbase/hbase-0.94.6-cdh4.4.0-security.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar $ pig 2012-02-08 23:39:41,819 [main] INFO org.apache.pig.Main - Logging error messages to: /home/arvind/pig-0.9.2-cdh4b1/bin/pig_1328773181817.log 2012-02-08 23:39:41,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/ ... grunt>
To start the Grunt Shell (YARN): Important: For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
Pig Installation
Use the following commands, replacing the <component_version> string with the current HBase, ZooKeeper and CDH version numbers.
$ export PIG_CONF_DIR=/usr/lib/pig/conf $ export P I G _ C L A S S P A T H = / u s r / l i b / h b a s e / h b a s e < H B a s e _ v e r s i o n > c d h < C D H _ v e r s i o n > s e c u r i t y . j a r : / u s r / l i b / z o o k e e p e r / z o o k e e p e r < Z o o K e e p e r _ v e r s i o n > c d h < C D H _ v e r s i o n > . j a r $ pig ... grunt>
For example,
$ export PIG_CONF_DIR=/usr/lib/pig/conf $ export PIG_CLASSPATH=/usr/lib/hbase/hbase-0.94.6-cdh4.4.0-security.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar $ pig ... grunt>
To verify that the input and output directories from the example grep job exist (see Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode), list an HDFS directory from the Grunt Shell:
grunt> ls hdfs://localhost/user/joe/input <dir> hdfs://localhost/user/joe/output <dir>
To check the status of your job while it is running, look at the JobTracker web console http://localhost:50030/.
For example,
register /usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar register /usr/lib/hbase/hbase-0.94.6-cdh4.4.0-security.jar
Installing DataFu
DataFu is a collection of Apache Pig UDFs (User-Defined Functions) for statistical evaluation that were developed by LinkedIn and have now been open sourced under an Apache 2.0 license. To use DataFu:
Pig Installation
1. Install the DataFu package: Operating system Install command
Red-Hat-compatible
SLES
Debian or Ubuntu
This puts the datafu-0.0.4-cdh4.4.0.jar file in /usr/lib/pig. 2. Register the JAR. Replace the <component_version> string with the current DataFu and CDH version numbers.
REGISTER /usr/lib/pig/datafu-<DataFu_version>-cdh<CDH_version>.jar
For example,
REGISTER /usr/lib/pig/datafu-0.0.4-cdh4.4.0.jar
Oozie Installation
Oozie Installation
About Oozie Packaging Prerequisites Upgrading Oozie Installing Oozie Configuring Oozie Starting, Stopping, and Using the Server Configuring Failover Apache Oozie Documentation
About Oozie
Apache Oozie Workflow Scheduler for Hadoop is a workflow and coordination service for managing Apache Hadoop jobs: Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions; actions are typically Hadoop jobs (MapReduce, Streaming, Pipes, Pig, Hive, Sqoop, etc). Oozie Coordinator jobs trigger recurrent Workflow jobs based on time (frequency) and data availability. Oozie Bundle jobs are sets of Coordinator jobs managed as a single job. Oozie is an extensible, scalable and data-aware service that you can use to orchestrate dependencies among jobs running on Hadoop. To find out more about Oozie, see http://archive.cloudera.com/cdh4/cdh/4/oozie/. To install or upgrade Oozie, follow the directions on this page. Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)
Oozie Packaging
There are two packaging options for installing Oozie: Separate RPM packages for the Oozie server (oozie) and client (oozie-client) Separate Debian packages for the Oozie server (oozie) and client (oozie-client) You can also download an Oozie tarball.
Oozie Installation
Oozie Prerequisites
Prerequisites for installing Oozie server: An operating system supported by CDH4 Oracle JDK A supported database if you are not planning to use the default (Derby). Prerequisites for installing Oozie client: Oracle JDK Note: To see which version of Oozie is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes. The following CDH4 versions of Hadoop work with the CDH4 version of Oozie: Beta 2, CDH4.0.0.
Upgrading Oozie
Follow these instructions to upgrade Oozie to CDH4 from RPM or Debian Packages. Before you start: Make sure there are no workflows in RUNNING or SUSPENDED status; otherwise the database upgrade will fail and you will have to reinstall Oozie CDH3 to complete or kill those running workflows.
Important: Ubuntu and Debian upgrades When you uninstall CDH3 Oozie on Ubuntu and Debian systems, the contents of /var/lib/oozie are removed, leaving a bare directory. This can cause the Oozie upgrade to CDH4 to fail. To prevent this, either copy the database files to another location and restore them after the uninstall, or recreate them after the uninstall. Make sure you do this before starting the re-install.
Oozie Installation
Step 1: Remove Oozie 1. Back up the Oozie configuration files in /etc/oozie and the Oozie database. For convenience you may want to save Oozie configuration files in your home directory; you will need them after installing the new version of Oozie. 2. Stop the Oozie Server. To stop the Oozie Server:
sudo service oozie stop
3. Uninstall Oozie. To uninstall Oozie, run the appropriate command on each host: On Red Hat-compatible systems:
$ sudo yum remove oozie-client
On SLES systems:
$ sudo zypper remove oozie-client
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 2: Install Oozie Follow the procedure under Installing Oozie and then proceed to Configuring Oozie after Upgrading from CDH3. For packaging information, see Oozie Packaging. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
Upgrading Oozie from an Earlier CDH4 Release to the Latest CDH4 Release
The steps that follow assume you are upgrading Oozie as part of an overall upgrade to the latest CDH4 release and have already performed the steps under Upgrading from an Earlier CDH4 Release. To upgrade Oozie to the latest CDH4 release, proceed as follows.
Oozie Installation
Step 1: Back Up the Configuration Back up the Oozie configuration files in /etc/oozie and the Oozie database. For convenience you may want to save Oozie configuration files in your home directory; you will need them after installing the new version of Oozie. Step 2: Stop the Oozie Server. To stop the Oozie Server:
sudo service oozie stop
Step 3: Install Oozie Follow the procedure under Installing Oozie and then proceed to Configuring Oozie after Upgrading from an Earlier CDH4 Release. Important: During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.
Installing Oozie
Oozie is distributed as two separate packages; a client package (oozie-client) and a server package (oozie). Depending on what you are planning to install, choose the appropriate packages and install them using your preferred package manager application. Note: The Oozie server package, oozie, is preconfigured to work with MRv1. To configure the Oozie server to work with Hadoop MapReduce YARN, see Configuring the Hadoop Version to Use.
Important: If you have not already done so, install Cloudera's Yum, zypper/YaST or Apt repository before using the following commands to install Oozie. For instructions, see CDH4 Installation. To install the Oozie server package on an Ubuntu and other Debian system:
$ sudo apt-get install oozie
To install the Oozie client package on an Ubuntu and other Debian system:
$ sudo apt-get install oozie-client
Oozie Installation
To install the Oozie server package on a Red Hat-compatible system:
$ sudo yum install oozie
Note: Installing the oozie package creates an oozie service configured to start Oozie at system startup time. You are now ready to configure Oozie. See the next section.
Configuring Oozie
This section explains how to configure which Hadoop version to use, and provides separate procedures for each of the following: configuring Oozie after upgrading from CDH3 configuring Oozie after upgrading from an earlier CDH4 release configuring Oozie after a fresh install.
Oozie Installation
Step 1: Update Configuration Files 1. Edit the the new Oozie CDH4 oozie-site.xml, and set all customizable properties to the values you set in the CDH3 oozie-site.xml: Important: DO NOT copy over the CDH3 configuration files into the CDH4 configuration directory. The configuration property names for the database settings have changed between Oozie CDH3 and Oozie CDH4: the prefix for these names has changed from oozie.service.StoreService.* to oozie.service.JPAService.*. Make sure you use the new prefix. 2. If necessary do the same for the oozie-log4j.properties, oozie-env.sh and the adminusers.txt files. Step 2: Upgrade the Database Important: Do not proceed before you have edited the configuration files as instructed in Step 1. Before running the database upgrade tool, copy or symlink the MySQL JDBC driver JAR into the /var/lib/oozie/ directory. Oozie CDH4 provides a command-line tool to perform the database schema and data upgrade that is required when you upgrade Oozie from CDH3 to CDH4. The tool uses Oozie configuration files to connect to the database and perform the upgrade. The database upgrade tool works in two modes: it can do the upgrade in the database or it can produce an SQL script that a database administrator can run manually. If you use the tool to perform the upgrade, you must do it as a database user who has permissions to run DDL operations in the Oozie database. To run the Oozie database upgrade tool against the database: Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work properly because of incorrect file permissions.
Oozie Installation
DONE Oozie DB has been upgraded to Oozie version '3.3.2-cdh4.4.0' The SQL commands have been written to: /tmp/ooziedb-5737263881793872034.sql
To create the upgrade script: Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work properly because of incorrect file permissions.
For example:
$ bin/ooziedb.sh upgrade -sqlfile oozie-upgrade.sql
Important: If you used the -sqlfile option instead of -run, Oozie database schema has not been upgraded. You need to run the oozie-upgrade script against your database.
Step 3: Upgrade the Oozie Sharelib Important: This step is required; CDH4 Oozie does not work with CDH3 shared libraries. CDH4 Oozie has a new shared library which bundles CDH4 JAR files for streaming, DistCp and for Pig, Hive and Sqoop.
Oozie Installation
The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you install the right one for the MapReduce version you are using: The shared library file for MRv1 is oozie-sharelib.tar.gz. The shared library file for YARN is oozie-sharelib-yarn.tar.gz. 1. Delete the Oozie shared libraries from HDFS. For example:
$ sudo -u oozie hadoop fs -rmr /user/oozie/share
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command> 2. Expand the Oozie CDH4 shared libraries in a local temp directory and copy them to HDFS. For example:
$ $ $ $ mkdir /tmp/ooziesharelib cd /tmp/ooziesharelib tar xzf /usr/lib/oozie/oozie-sharelib.tar.gz sudo -u oozie hadoop fs -put share /user/oozie/share
Important: If you are installing Oozie to work with YARN use oozie-sharelib-yarn.tar.gz instead.
Note: If the current shared libraries are in another location, make sure you use this other location when you run the above commands, and if necessary edit the oozie-site.xml configuration file to point to the right location.
Step 4: Start the Oozie Server Now you can start Oozie:
$ sudo service oozie start
Check Oozie's oozie.log to verify that Oozie has started successfully. Step 5: Upgrade the Oozie Client Although older Oozie clients work with the new Oozie server, you need to install the new version of the Oozie client in order to use all the functionality of the Oozie server. To upgrade the Oozie client, if you have not already done so, follow the steps under Installing Oozie.
Oozie Installation
Step 1: Update Configuration Files 1. Edit the the new Oozie CDH4 oozie-site.xml, and set all customizable properties to the values you set in the previous oozie-site.xml. 2. If necessary do the same for the oozie-log4j.properties, oozie-env.sh and the adminusers.txt files. Step 2: Upgrade the Oozie Sharelib Important: This step is required; the current version of Oozie does not work with shared libraries from an earlier version. The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you install the right one for the MapReduce version you are using: The shared library file for MRv1 is oozie-sharelib.tar.gz. The shared library file for YARN is oozie-sharelib-yarn.tar.gz. 1. Delete the Oozie shared libraries from HDFS. For example:
$ sudo -u oozie hadoop fs -rmr /user/oozie/share
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command> 2. Expand the Oozie CDH4 shared libraries in a local temp directory and copy them to HDFS. For example:
$ $ $ $ mkdir /tmp/ooziesharelib cd /tmp/ooziesharelib tar xzf /usr/lib/oozie/oozie-sharelib.tar.gz sudo -u oozie hadoop fs -put share /user/oozie/share
Important: If you are installing Oozie to work with YARN use oozie-sharelib-yarn.tar.gz instead.
Note: If the current shared libraries are in another location, make sure you use this other location when you run the above commands, and if necessary edit the oozie-site.xml configuration file to point to the right location.
Oozie Installation
Step 3: Start the Oozie Server Now you can start Oozie:
$ sudo service oozie start
Check Oozie's oozie.log to verify that Oozie has started successfully. Step 4: Upgrade the Oozie Client Although older Oozie clients work with the new Oozie server, you need to install the new version of the Oozie client in order to use all the functionality of the Oozie server. To upgrade the Oozie client, if you have not already done so, follow the steps under Installing Oozie.
binaries
/usr/lib/oozie/
configuration
/etc/oozie/conf/
documentation
examples TAR.GZ
Oozie Installation
Type of File
Where Installed
sharelib TAR.GZ
/usr/lib/oozie/
data
/var/lib/oozie/
logs
/var/log/oozie/
temp
/var/tmp/oozie/
PID file
/var/run/oozie/
Deciding which Database to Use Oozie has a built-in Derby database, but Cloudera recommends that you use a Postgres, MySQL, or Oracle database instead, for the following reasons: Derby runs in embedded mode and it is not possible to monitor its health. It is not clear how to implement a live backup strategy for the embedded Derby database, though it may be possible. Under load, Cloudera has observed locks and rollbacks with the embedded Derby database which don't happen with server-based databases. Configuring Oozie to Use Postgres Use the procedure that follows to configure Oozie to use PostgreSQL instead of Apache Derby. Step 1: Install PostgreSQL 8.4.x or 9.0.x. Note: See CDH4 Requirements and Supported Versions for tested versions. Step 2: Create the Oozie user and Oozie database. For example, using the Postgres psql command-line tool:
$ psql -U postgres Password for user postgres: ***** postgres=# CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD 'oozie' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; CREATE ROLE postgres=# CREATE DATABASE "oozie" WITH OWNER = oozie ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF8' LC_CTYPE = 'en_US.UTF8'
Oozie Installation
Step 3: Configure Postgres to accept network connections for user oozie . Edit the Postgres data/pg_hba.conf file as follows:
host oozie oozie 0.0.0.0/0 md5
Step 5: Configure Oozie to use Postgres. Edit the oozie-site.xml file as follows:
... <property> <name>oozie.service.JPAService.jdbc.driver</name> <value>org.postgresql.Driver</value> </property> <property> <name>oozie.service.JPAService.jdbc.url</name> <value>jdbc:postgresql://localhost:5432/oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.username</name> <value>oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.password</name> <value>oozie</value> </property> ...
Note: In the JDBC URL property, replace localhost with the hostname where Postgres is running. In the case of Postgres, unlike MySQL or Oracle, there is no need to download and install the JDBC driver separately, as it is license-compatible with Oozie and bundled with it.
Configuring Oozie to Use MySQL Use the procedure that follows to configure Oozie to use MySQL instead of Apache Derby. Step 1: Install and start MySQL 5.x Note: See CDH4 Requirements and Supported Versions for tested versions. Step 2: Create the Oozie database and Oozie MySQL user. For example, using the MySQL mysql command-line tool:
$ mysql -u root -p Enter password: ****** mysql> create database oozie;
Oozie Installation
Query OK, 1 row affected (0.03 sec) mysql> grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'oozie'; Query OK, 0 rows affected (0.03 sec) mysql> grant all privileges on oozie.* to 'oozie'@'%' identified by 'oozie'; Query OK, 0 rows affected (0.03 sec) mysql> exit Bye
Step 3: Configure Oozie to use MySQL. Edit properties in the oozie-site.xml file as follows:
... <property> <name>oozie.service.JPAService.jdbc.driver</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>oozie.service.JPAService.jdbc.url</name> <value>jdbc:mysql://localhost:3306/oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.username</name> <value>oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.password</name> <value>oozie</value> </property> ...
Note: In the JDBC URL property, replace localhost with the hostname where MySQL is running. Step 4: Add the MySQL JDBC driver JAR to Oozie. Copy or symlink the MySQL JDBC driver JAR into the /var/lib/oozie/ directory. Note: You must manually download the MySQL JDBC driver JAR file.
Configuring Oozie to use Oracle Use the procedure that follows to configure Oozie to use Oracle 11g instead of Apache Derby. Note: See CDH4 Requirements and Supported Versions for tested versions. Step 1: Install and start Oracle 11g. Step 2: Create the Oozie Oracle user. For example, using the Oracle sqlplus command-line tool:
$ sqlplus system@localhost Enter password: ******
Oozie Installation
SQL> create user oozie identified by oozie default tablespace users temporary tablespace temp; User created. SQL> grant all privileges to oozie; Grant succeeded. SQL> exit $
Step 3: Configure Oozie to use Oracle. Edit the oozie-site.xml file as follows:
... <property> <name>oozie.service.JPAService.jdbc.driver</name> <value>oracle.jdbc.driver.OracleDriver</value> </property> <property> <name>oozie.service.JPAService.jdbc.url</name> <value>jdbc:oracle:thin:@localhost:1521:oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.username</name> <value>oozie</value> </property> <property> <name>oozie.service.JPAService.jdbc.password</name> <value>oozie</value> </property> ...
Note: In the JDBC URL property, replace localhost with the hostname where Oracle is running and replace oozie with the TNS name of the Oracle database. Step 4: Add the Oracle JDBC driver JAR to Oozie. Copy or symlink the Oracle JDBC driver JAR into the /var/lib/oozie/ directory. Note: You must manually download the Oracle JDBC driver JAR file.
Creating the Oozie Database Schema After configuring Oozie database information and creating the corresponding database, create the Oozie database schema. Oozie provides a database tool for this purpose. Note: The Oozie database tool uses Oozie configuration files to connect to the database to perform the schema creation; before you use the tool, make you have created a database and configured Oozie to work with it as described above. The Oozie database tool works in 2 modes: it can create the database, or it can produce an SQL script that a database administrator can run to create the database manually. If you use the tool to create the database schema, you must have the permissions needed to execute DDL operations.
Oozie Installation
To run the Oozie database tool against the database: Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work properly because of incorrect file permissions.
To create the upgrade script: Important: This step must be done as the oozie Unix user, otherwise Oozie may fail to start or work properly because of incorrect file permissions. Run /usr/lib/oozie/bin/ooziedb.sh create -sqlfile <SCRIPT>. For example:
$ sudo -u oozie /usr/lib/oozie/bin/ooziedb.sh create -sqlfile oozie-create.sql
Important: If you used the -sqlfile option instead of -run, Oozie database schema has not been created. You need to run the oozie-create.sql script against your database.
Oozie Installation
Enabling the Oozie Web Console To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows. Step 1: Download the Library Download the ExtJS version 2.2 library from http://archive.cloudera.com/gplextras/misc/ext-2.2.zip and place it a convenient location. Step 2: Install the Library Extract the ext-2.2.zip file into /var/lib/oozie. Configuring Oozie with Kerberos Security To configure Oozie with Kerberos security, see Oozie Security Configuration. Installing the Oozie ShareLib in Hadoop HDFS The Oozie installation bundles Oozie ShareLib, which contains all of the necessary JARs to enable workflow jobs to run streaming, DistCp, Pig, Hive, and Sqoop actions. The Oozie installation bundles two shared libraries, one for MRv1 and one for YARN. Make sure you install the right one for the MapReduce version you are using: The shared library file for MRv1 is oozie-sharelib.tar.gz. The shared library file for YARN is oozie-sharelib-yarn.tar.gz. Important: If Hadoop is configured with Kerberos security enabled, you must first configure Oozie with Kerberos Authentication. For instructions, see Oozie Security Configuration. Before running the commands in the following instructions, you must run the sudo -u oozie kinit -k -t /etc/oozie/oozie.keytab and kinit -k hdfs commands. Then, instead of using commands in the form sudo -u <user> <command>, use just <command>; for example, $ hadoop fs -mkdir
/user/oozie
To install Oozie ShareLib in Hadoop HDFS in the oozie user home directory:
$ $ $ $ $ $ sudo -u hdfs hadoop fs -mkdir /user/oozie sudo -u hdfs hadoop fs -chown oozie:oozie /user/oozie mkdir /tmp/ooziesharelib cd /tmp/ooziesharelib tar xzf /usr/lib/oozie/oozie-sharelib.tar.gz sudo -u oozie hadoop fs -put share /user/oozie/share
Important: If you are installing Oozie to work with YARN use oozie-sharelib-yarn.tar.gz instead.
Configuring Support for Oozie Uber JARs An uber JAR is a JAR that contains other JARs with dependencies in a lib/ folder inside the JAR. Beginning with CDH4.1, you can configure the cluster to handle uber JARs properly for the MapReduce action (as long as it does not include any streaming or pipes) by setting the following property in the oozie-site.xml file:
... <property> <name>oozie.action.mapreduce.uber.jar.enable</name>
Oozie Installation
<value>true</value> ...
When this property is set, users can use the oozie.mapreduce.uber.jar configuration property in their MapReduce workflows to notify Oozie that the specified JAR file is an uber JAR. Configuring Oozie to Run against a Federated Cluster To run Oozie against a federated HDFS cluster using ViewFS, configure the
oozie.service.HadoopAccessorService.supported.filesystems property in oozie-site.xml as
follows:
<property> <name>oozie.service.HadoopAccessorService.supported.filesystems</name> <value>hdfs,viewfs</value> </property>
If you see the message Oozie System ID [oozie-oozie] started in the oozie.log log file, the system has started successfully. Note: By default, Oozie server runs on port 11000 and its URL is http://<OOZIE_HOSTNAME>:11000/oozie.
To make it convenient to use this utility, set the environment variable OOZIE_URL to point to the URL of the Oozie server. Then you can skip the -oozie option.
Oozie Installation
For example, if you want to invoke the client on the same machine where the Oozie server is running, set the OOZIE_URL to http://localhost:11000/oozie.
$ export OOZIE_URL=http://localhost:11000/oozie $ oozie admin -version Oozie server build version: 3.1.3-cdh4.0.0
Important: If Oozie is configured with Kerberos Security enabled: You must have a Kerberos session running. For example, you can start a session by running the kinit command. Do not use localhost as in the above examples. As with every service that uses Kerberos, Oozie has a Kerberos principal in the form <SERVICE>/<HOSTNAME>@<REALM>. In a Kerberos configuration, you must use the <HOSTNAME> value in the Kerberos principal to specify the Oozie server; for example, if the <HOSTNAME> in the principal is myoozieserver.mydomain.com, set OOZIE_URL as follows:
$ export OOZIE_URL=http://myoozieserver.mydomain.com:11000/oozie
If you use an alternate hostname or the IP address of the service, Oozie will not work properly.
Oozie Installation
Points to note
The Virtual IP Address or Load Balancer can be used to periodically check the health of the hot server. If something is wrong, you can shut down the hot server, start the cold server, and redirect the Virtual IP Address or Load Balancer to the new hot server. This can all be automated with a script, but a false positive indicating the hot server is down will cause problems, so test your script carefully. There will be no data loss. Any running workflows will continue from where they left off. It takes only about 15 seconds to start the Oozie server. See also Configuring Oozie to use HDFS HA.
Hive Installation
Hive Installation
Note: Install Cloudera Repository Before using the instructions on this page to install or upgrade, install the Cloudera yum, zypper/YaST or apt repository, and install or upgrade CDH4 and make sure it is functioning correctly. For instructions, see CDH4 Installation and the instructions for upgrading to CDH4 or upgrading from an earlier CDH4 release.
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) Use the following sections to install, update, and configure Hive: About Hive Upgrading Hive Installing Hive Configuring the Metastore Configuring HiveServer2 Starting the Metastore File System Permissions Starting, Stopping, and Using HiveServer2 Starting the Hive Console Using Hive with HBase Installing the JDBC on Clients Setting HADOOP_MAPRED_HOME for YARN Configuring the Metastore for HDFS HA Troubleshooting Apache Hive Documentation
About Hive
Apache Hive is a powerful data warehousing application built on top of Hadoop; it enables you to access your data using Hive QL, a language that is similar to SQL. Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive Installation
HiveServer2
As of CDH4.1, you can deploy HiveServer2, an improved version of HiveServer that supports a new Thrift API tailored for JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. There is also a new CLI for HiveServer2 named BeeLine. Cloudera recommends you install HiveServer2, and use it whenever possible. (You can still use the original HiveServer when you need to, and run it concurrently with HiveServer2; see Configuring HiveServer2).
Upgrading Hive
Upgrade Hive on all the hosts on which it is running: servers and clients. Note: To see which version of Hive is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes.
Step 1: Remove Hive Warning: You must make sure no Hive processes are running. If Hive processes are running during the upgrade, the new version will not work correctly. 1. Exit the Hive console and make sure no Hive scripts are running. 2. Stop any HiveServer processes that are running. If HiveServer is running as a daemon, use the following command to stop it:
$ sudo service hive-server stop
If HiveServer is running from the command line, stop it with <CTRL>-c. 3. Stop the metastore. If the metastore is running as a daemon, use the following command to stop it:
$ sudo service hive-metastore stop
If the metastore is running from the command line, stop it with <CTRL>-c. 4. Remove Hive:
Hive Installation
Note: The following examples show how to uninstall Hive packages on a CDH3 system. Note that CDH3 and CDH4 use different names for the Hive packages: in CDH3 Hive, package names begin with the prefix hadoop-hive, while in CDH4 they begin with the prefix hive. (If you are already running CDH4 and upgrading to the latest version, you do not need to remove Hive: see Upgrading Hive from an Earlier Version of CDH4.)
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 2: Install the new Hive version on all hosts (Hive servers and clients) See Installing Hive. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
Step 3: Configure the Hive Metastore You must configure the Hive metastore and initialize the service before you start the Hive Console. See Configuring the Hive Metastore for detailed instructions. Step 4: Upgrade the Metastore Schema The current version of CDH4 includes changes in the Hive metastore schema. If you have been using Hive 0.9 or earlier, you must upgrade the Hive metastore schema after you install the new version of Hive but before you start Hive. To do this, run the appropriate schema upgrade scripts in /usr/lib/hive/scripts/metastore/upgrade/: Schema upgrade scripts from 0.7 to 0.8 and from 0.8 to 0.9 for Derby, MySQL, and PostgreSQL 0.8 and 0.9 schema scripts for Oracle, but no upgrade scripts (you will need to create your own) Schema upgrade scripts from 0.9 to 0.10 for Derby, MySQL, PostgreSQL and Oracle
Hive Installation
Note: To upgrade Hive from CDH3 to CDH4, you must upgrade the schema to 0.8, then to 0.9, and then to 0.10. Important: Cloudera strongly encourages you to make a backup copy of your metastore database before running the upgrade scripts. You will need this backup copy if you run into problems during the upgrade or need to downgrade to a previous version. You must upgrade the metastore schema before starting Hive after the upgrade. Failure to do so may result in metastore corruption. To run a script, you must first cd to the directory that script is in: that is /usr/lib/hive/scripts/metastore/upgrade/<database>. For more information about upgrading the schema, see the README in /usr/lib/hive/scripts/metastore/upgrade/. Step 5: Configure HiveServer2 HiveServer2 is an improved version of the original HiveServer (HiveServer1). Cloudera recommends using HiveServer2 instead of HiveServer1 as long as you do not depend directly on HiveServer1's Thrift API. Some configuration is required before you initialize HiveServer2; see Configuring HiveServer2 for details. Note: If you need to run HiveServer1 You can continue to run HiveServer1 on CDH4.1 and later if you need it for backward compatibility; for example, you may have existing Perl and Python scripts that use the native HiveServer1 Thrift bindings. You can install and run HiveServer1 and HiveServer2 concurrently on the same system; see Running HiveServer2 and HiveServer Concurrently.
Step 6: Upgrade Scripts, etc., for HiveServer2 (if necessary) If you have been running HiveServer1, you may need to make some minor modifications to your client-side scripts and applications when you upgrade: HiveServer1 does not support concurrent connections, so many customers run a dedicated instance of HiveServer1 for each client. These can now be replaced by a single instance of HiveServer2. HiveServer2 uses a different connection URL and driver class for the JDBC driver. If you have existing scripts that use JDBC to communicate with HiveServer1, you can modify these scripts to work with HiveServer2 by changing the JDBC driver URL from jdbc:hive://hostname:port to jdbc:hive2://hostname:port, and by changing the JDBC driver class name from org.apache.hive.jdbc.HiveDriver to org.apache.hive.jdbc.HiveDriver. Step 7: Start the Metastore, HiveServer2, and BeeLine See: Starting the Metastore Starting HiveServer2 Using BeeLine
Hive Installation
Remove this property before you proceed; otherwise Hive queries spawned from MapReduce jobs will fail with a null pointer exception (NPE). To upgrade Hive from an earlier version of CDH4, proceed as follows. Step 1: Stop all Hive Processes and Daemons Warning: You must make sure no Hive processes are running. If Hive processes are running during the upgrade, the new version will not work correctly. 1. Exit the Hive console and make sure no Hive scripts are running. 2. Stop any HiveServer processes that are running. If HiveServer is running as a daemon, use the following command to stop it:
$ sudo service hive-server stop
If HiveServer is running from the command line, stop it with <CTRL>-c. 3. Stop any HiveServer2 processes that are running. If HiveServer2 is running as a daemon, use the following command to stop it:
$ sudo service hive-server2 stop
If HiveServer2 is running from the command line, stop it with <CTRL>-c. 4. Stop the metastore. If the metastore is running as a daemon, use the following command to stop it:
$ sudo service hive-metastore stop
If the metastore is running from the command line, stop it with <CTRL>-c. Step 2: Install the new Hive version on all hosts (Hive servers and clients) See Installing Hive. Step 3: Verify that the Hive Metastore is Properly Configured See Configuring the Hive Metastore for detailed instructions.
Hive Installation
Step 4: Upgrade the Metastore Schema The current version of CDH4 includes changes in the Hive metastore schema. If you have been using Hive 0.9 or earlier, you must upgrade the Hive metastore schema after you install the new version of Hive but before you start Hive. To do this, run the appropriate schema upgrade scripts in /usr/lib/hive/scripts/metastore/upgrade/: Schema upgrade scripts from 0.7 to 0.8 and from 0.8 to 0.9 for Derby, MySQL, and PostgreSQL 0.8 and 0.9 schema scripts for Oracle, but no upgrade scripts (you will need to create your own) Schema upgrade scripts from 0.9 to 0.10 for Derby, MySQL, PostgreSQL and Oracle Important: Cloudera strongly encourages you to make a backup copy of your metastore database before running the upgrade scripts. You will need this backup copy if you run into problems during the upgrade or need to downgrade to a previous version. You must upgrade the metastore schema before starting Hive. Failure to do so may result in metastore corruption. To run a script, you must first cd to the directory that script is in: that is /usr/lib/hive/scripts/metastore/upgrade/<database>. For more information about upgrading the schema, see the README in /usr/lib/hive/scripts/metastore/upgrade/. Step 5: Configure HiveServer2 HiveServer2 is an improved version of the original HiveServer (HiveServer1). Cloudera recommends using HiveServer2 instead of HiveServer1 in most cases. Some configuration is required before you initialize HiveServer2; see Configuring HiveServer2 for details. Note: If you need to run HiveServer1 You can continue to run HiveServer1 on CDH4.1 and later if you need it for backward compatibility; for example, you may have existing Perl and Python scripts that use the native HiveServer1 Thrift bindings. You can install and run HiveServer1 and HiveServer2 concurrently on the same systems; see Running HiveServer2 and HiveServer Concurrently.
Step 6: Upgrade Scripts, etc., for HiveServer2 (if necessary) If you have been running HiveServer1, you may need to make some minor modifications to your client-side scripts and applications when you upgrade: HiveServer1 does not support concurrent connections, so many customers run a dedicated instance of HiveServer1 for each client. These can now be replaced by a single instance of HiveServer 2. HiveServer2 uses a different connection URL and driver class for the JDBC driver; scripts may need to be modified to use the new version. Perl and Python scripts that use the native HiveServer1 Thrift bindings may need to be modified to use the HiveServer2 Thrift bindings. Step 7: Start the Metastore, HiveServer2, and BeeLine See: Starting the Metastore
Hive Installation
Starting HiveServer2 Using BeeLine The upgrade is now complete.
Installing Hive
Install the appropriate Hive packages using the appropriate command for your distribution. RedHat and CentOS systems
$ sudo yum install <pkg1> <pkg2> ...
SLES systems
hive base package that provides the complete language and runtime (required) hive-metastore provides scripts for running the metastore as a standalone service (optional) hive-server provides scripts for running the original HiveServer as a standalone service (optional) hive-server2 provides scripts for running the new HiveServer2 as a standalone service (optional)
Embedded Mode Cloudera recommends using this mode for experimental purposes only.
Hive Installation
This is the default metastore deployment mode for CDH. In this mode the metastore uses a Derby database, and both the database and the metastore service run embedded in the main HiveServer process. Both are started for you when you start the HiveServer process. This mode requires the least amount of effort to configure, but it can support only one active user at a time and is not certified for production use. Local Mode
In this mode the Hive metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process, and can be on a separate host. The embedded metastore service communicates with the metastore database over JDBC. Remote Mode Cloudera recommends that you use this mode.
Hive Installation
In this mode the Hive metastore service runs in its own JVM process; HiveServer2, HCatalog, Cloudera Impala, and other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property). The database, the HiveServer process, and the metastore service can all be on the same host, but running the HiveServer process on a separate host provides better availability and scalability. The main advantage of Remote mode over Local mode is that Remote mode does not require the administrator to share JDBC login information for the metastore database with each Hive user. HCatalogrequires this mode.
Hive Installation
Configuring a remote MySQL database for the Hive Metastore Cloudera recommends you configure a database for the metastore on one or more remote servers (that is, on a host or hosts separate from the HiveServer1 or HiveServer2 process). MySQL is the most popular database to use. Proceed as follows. Step 1: Install and start MySQL if you have not already done so To install MySQL on a Red Hat system:
$ sudo yum install mysql-server
After using the command to install MySQL, you may need to respond to prompts to confirm that you do want to complete the installation. After installation completes, start the mysql daemon. On Red Hat systems
$ sudo service mysqld start
Step 2: Configure the MySQL Service and Connector Before you can run the Hive metastore with a remote MySQL database, you must configure a connector to the remote MySQL database, set up the initial database schema, and configure the MySQL user account for the Hive user. To install the MySQL connector on a Red Hat 6 system: Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.
$ sudo yum install mysql-connector-java $ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
To install the MySQL connector on a Red Hat 5 system: Download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/5.1.html. You will need to sign up for an account if you don't already have one, and log in, before you can download it. Then copy it to the /usr/lib/hive/lib/ directory. For example:
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /usr/lib/hive/lib/
Note: At the time of publication, version was 5.1.25, but the version may have changed by the time you read this. To install the MySQL connector on a SLES system:
Hive Installation
Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory.
$ sudo zypper install mysql-connector-java $ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
To install the MySQL connector on a Debian/Ubuntu system: Install mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/ directory. Note: For Ubuntu Precise systems, the name of the connector JAR file is mysql.jar. Modify the second command given below accordingly.
$ sudo apt-get install libmysql-java $ ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar
Configure MySQL to use a strong password and to start at boot. Note that in the following procedure, your current root password is blank. Press the Enter key when you're prompted for the root password. To set the MySQL root password:
$ sudo /usr/bin/mysql_secure_installation [...] Enter current password for root (enter for none): OK, successfully used password, moving on... [...] Set root password? [Y/n] y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...] Disallow root login remotely? [Y/n] N [...] Remove test database and access to it [Y/n] Y [...] Reload privilege tables now? [Y/n] Y All done!
To make sure the MySQL server starts at boot: On Red Hat systems:
$ sudo /sbin/chkconfig mysqld on $ sudo /sbin/chkconfig --list mysqld mysqld 0:off 1:off 2:on
3:on
4:on
5:on
6:off
On SLES systems:
$ sudo chkconfig --add mysql
On Debian/Ubuntu systems:
$ sudo chkconfig mysql on
Step 3. Create the Database and User The instructions in this section assume you are using Remote mode, and that the MySQL database is installed on a separate host from the metastore service, which is running on a host named metastorehost in the example.
Hive Installation
Note: If the metastore service will run on the host where the database is installed, replace 'metastorehost' in the CREATE USER example with 'localhost'. Similarly, the value of javax.jdo.option.ConnectionURL in /etc/hive/conf/hive-site.xml (discussed in the next step) must be jdbc:mysql://localhost/metastore. For more information on adding MySQL users, see http://dev.mysql.com/doc/refman/5.5/en/adding-users.html. Create the initial database schema using the hive-schema-0.10.0.mysql.sql file located in the /usr/lib/hive/scripts/metastore/upgrade/mysql directory. Example
$ mysql -u root -p Enter password: mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
You also need a MySQL user account for Hive to use to access the metastore. It is very important to prevent this user account from creating or altering tables in the metastore database schema. Important: If you fail to restrict the ability of the metastore MySQL user account to create and alter tables, it is possible that users will inadvertently corrupt the metastore schema when they use older or newer versions of Hive. Example
mysql> CREATE USER 'hive'@'metastorehost' IDENTIFIED BY 'mypassword'; ... mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'metastorehost'; mysql> FLUSH PRIVILEGES; mysql> quit;
Step 4: Configure the Metastore Service to Communicate with the MySQL Database This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the MySQL database, and provides sample settings. Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host. Given a MySQL database running on myhost and the user account hive with the password mypassword, set the configuration as follows (overwriting any existing values). Note: The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
Hive Installation
<property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mypassword</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://<n.n.n.n>:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property>
Configuring a remote PostgreSQL database for the Hive Metastore Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a connector to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user. Step 1: Install and start PostgreSQL if you have not already done so To install PostgreSQL on a Red Hat system:
$ sudo yum install postgresql-server
After using the command to install PostgreSQL, you may need to respond to prompts to confirm that you do want to complete the installation. In order to finish installation on Red Hat compatible systems, you need to initialize the database. Please note that this operation is not needed on Ubuntu and SLES systems as it's done automatically on first start: To initialize database files on Red Hat compatible systems
$ sudo service postgresql initdb
Hive Installation
To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional configuration. First you need to edit the postgresql.conf file. Set the listen property to * to make sure that the PostgreSQL server starts listening on all your network interfaces. Also make sure that the standard_conforming_strings property is set to off. You can check that you have the correct values as follows: On Red-Hat-compatible systems:
$ sudo cat /var/lib/pgsql/data/postgresql.conf standard_conforming_strings listen_addresses = '*' standard_conforming_strings = off | grep -e listen -e
On SLES systems:
$ sudo cat /var/lib/pgsql/data/postgresql.conf standard_conforming_strings listen_addresses = '*' standard_conforming_strings = off | grep -e listen -e
You also need to configure authentication for your network in pg_hba.conf. You need to make sure that the PostgreSQL user that you will create in the next step will have access to the server from a remote host. To do this, add a new line into pg_hba.con that has the following information:
host <database> md5 <user> <network address> <mask>
The following example to allows all users connect from all hosts to all your databases:
host all all 0.0.0.0 0.0.0.0 md5
Note: This configuration is applicable only for a network listener. Using this configuration won't open all your databases to the entire world; the user must still supply a password to authenticate himself, and privilege restrictions configured in PostgreSQL will still be applied. After completing the installation and configuration, you can start the database server: Start PostgreSQL Server
$ sudo service postgresql start
Use chkconfig utility to ensure that your PostgreSQL server will start at a boot time. For example:
chkconfig postgresql on
You can use the chkconfig utility to verify that PostgreSQL server will be started at boot time, for example:
chkconfig --list postgresql
Hive Installation
Step 2: Install the Postgres JDBC Driver Before you can run the Hive metastore with a remote PostgreSQL database, you must configure a JDBC driver to the remote PostgreSQL database, set up the initial database schema, and configure the PostgreSQL user account for the Hive user. To install the PostgreSQL JDBC Driver on a Red Hat 6 system: Install postgresql-jdbc package and create symbolic link to the /usr/lib/hive/lib/ directory. For example:
$ sudo yum install postgresql-jdbc $ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar
To install the PostgreSQL connector on a Red Hat 5 system: You need to manually download the PostgreSQL connector from http://jdbc.postgresql.org/download.html and move it to the /usr/lib/hive/lib/ directory. For example:
$ wget http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar $ mv postgresql-9.2-1002.jdbc4.jar /usr/lib/hive/lib/
Note: You may need to use a different version if you have a different version of Postgres. You can check the version as follows:
$ sudo rpm -qa | grep postgres
To install the PostgreSQL JDBC Driver on a SLES system: Install postgresql-jdbc and symbolically link the file into the /usr/lib/hive/lib/ directory.
$ sudo zypper install postgresql-jdbc $ ln -s /usr/share/java/postgresql-jdbc.jar /usr/lib/hive/lib/postgresql-jdbc.jar
To install the PostgreSQL JDBC Driver on a Debian/Ubuntu system: Install libpostgresql-jdbc-java and symbolically link the file into the /usr/lib/hive/lib/ directory.
$ sudo apt-get install libpostgresql-jdbc-java $ ln -s /usr/share/java/postgresql-jdbc4.jar /usr/lib/hive/lib/postgresql-jdbc4.jar
Step 3: Create the metastore database and user account Proceed as in the following example:
$ sudo u postgres psql postgres=# CREATE USER hiveuser WITH PASSWORD 'mypassword'; postgres=# CREATE DATABASE metastore; postgres=# \c metastore; You are now connected to database 'metastore'. postgres=# \i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-0.10.0.postgres.sql SET SET ...
Hive Installation
Now you need to grant permission for all metastore tables to user hiveuser. PostgreSQL does not have statements to grant the permissions for all tables at once; you'll need to grant the permissions one table at a time. You could automate the task with the following SQL script:
bash# sudo u postgres psql metastore=# \o /tmp/grant-privs metastore=# SELECT 'GRANT SELECT,INSERT,UPDATE,DELETE ON "' || schemaname || '"."' || tablename || '" TO hiveuser ;' metastore-# FROM pg_tables metastore-# WHERE tableowner = CURRENT_USER and schemaname = 'public'; metastore=# \o metastore=# \i /tmp/grant-privs
You can verify the connection from the machine where you'll be running the metastore service as follows:
psql h myhost U hiveuser d metastore metastore=#
Step 4: Configure the Metastore Service to Communicate with the PostgreSQL Database This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the PostgreSQL database. Though you can use the same hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that must be configured on all of them; the others are used only on the metastore host. Given a PostgreSQL database running on host myhost under the user account hive with the password mypassword, you would set configuration properties as follows. Note: The instructions in this section assume you are using Remote mode, and that the PostgreSQL database is installed on a separate host from the metastore server. The hive.metastore.local property is no longer supported as of Hive 0.10; setting hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:postgresql://myhost/metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.postgresql.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mypassword</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://<n.n.n.n>:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore
Hive Installation
host</description> </property>
Configuring a remote Oracle database for the Hive Metastore Before you can run the Hive metastore with a remote Oracle database, you must configure a connector to the remote Oracle database, set up the initial database schema, and configure the Oracle user account for the Hive user. Step 1: Install and start Oracle The Oracle database is not part of any Linux distribution and must be purchased, downloaded and installed separately. You can use Express edition that can be downloaded for free from Oracle website. Step 2: Install the Oracle JDBC Driver You must download the Oracle JDBC Driver from the Oracle website and put the file ojdbc6.jar into /usr/lib/hive/lib/ directory. The driver is available for download here.
$ sudo mv ojdbc6.jar /usr/lib/hive/lib/
Step 3: Create the Metastore database and user account Connect to your Oracle database as an administrator and create the user that will use the Hive metastore.
$ sqlplus "sys as sysdba" SQL> create user hiveuser identified by mypassword; SQL> grant connect to hiveuser; SQL> grant all privileges to hiveuser;
Connect as the newly created hiveuser user and load the initial schema:
$ sqlplus hiveuser SQL> @/usr/lib/hive/scripts/metastore/upgrade/oracle/hive-schema-0.10.0.oracle.sql
Connect back as an administrator and remove the power privileges from user hiveuser. Then grant limited access to all the tables:
$ sqlplus "sys as sysdba" SQL> revoke all privileges from hiveuser; SQL> BEGIN 2 FOR R IN (SELECT owner, table_name FROM all_tables WHERE owner='HIVEUSER') LOOP 3 EXECUTE IMMEDIATE 'grant SELECT,INSERT,UPDATE,DELETE on '||R.owner||'.'||R.table_name||' to hiveuser'; 4 END LOOP; 5 END; 6 7 /
Step 4: Configure the Metastore Service to Communicate with the Oracle Database This step shows the configuration properties you need to set in hive-site.xml to configure the metastore service to communicate with the Oracle database, and provides sample settings. Though you can use the same
Hive Installation
hive-site.xml on all hosts (client, metastore, HiveServer), hive.metastore.uris is the only property that
must be configured on all of them; the others are used only on the metastore host. Example Given an Oracle database running on myhost and the user account hiveuser with the password mypassword, set the configuration as follows (overwriting any existing values):
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:oracle:thin:@//myhost/xe</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>oracle.jdbc.OracleDriver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mypassword</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://<n.n.n.n>:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property>
Configuring HiveServer2
You must make the following configuration changes before using HiveServer2. Failure to do so may result in unpredictable behavior.
Hive Installation
Enable the lock manager by setting properties in /etc/hive/conf/hive-site.xml as follows (substitute your actual ZooKeeper node names for those in the example):
<property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value>zk1.myco.com,zk2.myco.com,zk3.myco.com</value> </property>
Important: Enabling the Table Lock Manager without specifying a list of valid Zookeeper quorum nodes will result in unpredictable behavior. Make sure that both properties are properly configured.
hive.zookeeper.client.port
If ZooKeeper is not using the default value for ClientPort, you need to set hive.zookeeper.client.port in /etc/hive/conf/hive-site.xml to the same value that ZooKeeper is using. Check /etc/zookeeper/conf/zoo.cfg to find the value for ClientPort. If ClientPort is set to any value other than 2181 (the default), set hive.zookeeper.client.port to the same value. For example, if ClientPort is set to 2222, set hive.zookeeper.client.port to 2222 as well:
<property> <name>hive.zookeeper.client.port</name> <value>2222</value> <description> The port at which the clients will connect. </description> </property>
JDBC driver
The connection URL format and the driver class are different for HiveServer2 and HiveServer1: HiveServer version Connection URL Driver Class
HiveServer2
jdbc:hive2://<host>:<port>
org.apache.hive.jdbc.HiveDriver
HiveServer1
jdbc:hive://<host>:<port>
o r g . a p a c h e . h a d o o p . h i v e . j d b c . H i v e D r i v e r
Authentication
HiveServer2 can be configured to authenticate all connections; by default, it allows any client to connect. HiveServer2 supports either Kerberos or LDAP authentication; configure this in the
Hive Installation
hive.server2.authentication property in the hive-site.xml file. You can also configure pluggable
authentication, which allows you to use a custom authentication provider for HiveServer2; and impersonation, which allows users to execute queries and access HDFS files as the connected user rather than the super user who started the HiveServer2 daemon. For more information, see Hive Security Configuration.
HiveServer2
HIVE_SERVER2_THRIFT_PORT
HIVE_SERVER2_THRIFT_BIND_HOST
HiveServer1
HIVE_PORT
Use Ctrl-c to stop the metastore process running from the command line. To run the metastore as a daemon, the command is:
$ sudo service hive-metastore start
Hive Installation
To stop HiveServer2:
$ sudo service hive-server2 stop
To confirm that HiveServer2 is working, start the beeline CLI and use it to execute a SHOW TABLES query on the HiveServer2 process:
$ /usr/lib/hive/bin/beeline beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver 0: jdbc:hive2://localhost:10000> SHOW TABLES; show tables; +-----------+ | tab_name |
Hive Installation
Note: If you using HiveServer2 on a cluster that does not have Kerberos security enabled, then the password is arbitrary in the command for starting BeeLine. At present the best source for documentation on BeeLine is the original SQLLine documentation.
To confirm that Hive is working, issue the show tables; command to list the Hive tables; be sure to use a semi-colon after the command:
hive> show tables; OK Time taken: 10.345 seconds
Hive Installation
can find current version numbers for CDH dependencies such as Guava in CDH's root pom.xml file for the current release, for example cdh-root-4.4.0.pom.)
ADD JAR /usr/lib/hive/lib/zookeeper.jar; ADD JAR /usr/lib/hive/lib/hive-hbase-handler-<Hive-HBase-Handler_version>-cdh<CDH_version>.jar ADD JAR /usr/lib/hive/lib/guava-<Guava_version>.jar;
For example,
ADD JAR /usr/lib/hive/lib/zookeeper.jar; ADD JAR /usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.4.0.jar ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
On SLES systems:
$ sudo zypper install hive-jdbc
2. Add /usr/lib/hive/lib/*.jar and /usr/lib/hadoop/*.jar to your classpath. You are now ready to run your JDBC client. For more information see the Hive Client document.
Hive Installation
Troubleshooting
This section provides guidance on problems you may encounter while installing, upgrading, or running Hive.
HCatalog Prerequisites
An operating system supported by CDH4 Oracle JDK The Hive metastore and its database. The Hive metastore must be running in remote mode (as a service).
To install the WebHCat REST server components on an Ubuntu or other Debian system:
$ sudo apt-get install webhcat-server
Note: It is not necessary to install WebHCat if you will not be using the REST API. Pig and MapReduce do not need it. You can change the default port 50111 by creating or editing the following file and restarting WebHCat:
/etc/webhcat/conf/webhcat-site.xml
To uninstall WebHCat you must remove two packages: webhcat-server and webhcat.
where <hostname> is the host where the HCatalog server components are running, for example hive.examples.com.
See the HCatalog documentation for information on using the HCatalog command-line application.
public class UseHCat extends Configured implements Tool { public static class Map extends Mapper<WritableComparable, HCatRecord, Text, IntWritable> { String groupname; @Override protected void map( WritableComparable key, HCatRecord value, org.apache.hadoop.mapreduce.Mapper<WritableComparable, HCatRecord, Text, IntWritable>.Context context) throws IOException, InterruptedException { // The group table from /etc/group has name, 'x', id groupname = (String) value.get(0); int id = (Integer) value.get(2); // Just select and emit the name and ID context.write(new Text(groupname), new IntWritable(id)); } } public static class Reduce extends Reducer<Text, IntWritable, WritableComparable, HCatRecord> { protected void reduce( Text key, java.lang.Iterable<IntWritable> values, org.apache.hadoop.mapreduce.Reducer<Text, IntWritable, WritableComparable, HCatRecord>.Context context) throws IOException, InterruptedException { // Only expecting one ID per group name Iterator<IntWritable> iter = values.iterator(); IntWritable iw = iter.next(); int id = iw.get(); // Emit the group name and ID as a record HCatRecord record = new DefaultHCatRecord(2); record.set(0, key.toString()); record.set(1, id); context.write(null, record); } } public int run(String[] args) throws Exception { Configuration conf = getConf(); args = new GenericOptionsParser(conf, args).getRemainingArgs(); // Get the input and output table names as arguments String inputTableName = args[0]; String outputTableName = args[1]; // Assume the default database String dbName = null; Job job = new Job(conf, "UseHCat"); HCatInputFormat.setInput(job, InputJobInfo.create(dbName, inputTableName, null)); job.setJarByClass(UseHCat.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); // An HCatalog record as input job.setInputFormatClass(HCatInputFormat.class); // Mapper emits a string as key and an integer as value job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class); // Ignore the key for the reducer output; emitting an HCatalog record as value job.setOutputKeyClass(WritableComparable.class); job.setOutputValueClass(DefaultHCatRecord.class); job.setOutputFormatClass(HCatOutputFormat.class); HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null)); HCatSchema s = HCatOutputFormat.getTableSchema(job); System.err.println("INFO: output schema explicitly set for writing:" + s); HCatOutputFormat.setSchema(job, s); return (job.waitForCompletion(true) ? 0 : 1); } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new UseHCat(), args); System.exit(exitCode); } }
Load data from the local file system into the groups table:
$ hive -e "load data local inpath '/etc/group' overwrite into table groups"
After compiling and creating a JAR file, set up the environment that is needed for copying required JAR files to HDFS and then run the job, for example: Note: You can find current version numbers for CDH dependencies in CDH's root pom.xml file for the current release, for example cdh-root-4.4.0.pom.)
$ export HCAT_HOME=/usr/lib/hcatalog $ export HIVE_HOME=/usr/lib/hive $ HCATJAR=$HCAT_HOME/share/hcatalog/hcatalog-core-0.5.0-cdh4.4.0.jar $ HCATPIGJAR=$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter-0.5.0-cdh4.4.0.jar $ HIVE_VERSION=0.10.0-cdh4.4.0 $ export H A D O O P _ C L A S S P A T H = $ H C A T J A R : $ H C A T P I G J A R : $ H I V E _ H O M E / l i b / h i v e e x e c $ H I V E _ V E R S I O N . j a r : $ H I V E _ H O M E / l i b / h i v e m e t a s t o r e $ H I V E _ V E R S I O N . j a r : $ H I V E _ H O M E / l i b / j d o 2 a p i 2 . 3 e c . j a r : $ H I V E _ H O M E / l i b / l i b f b 3 0 3 0 . 9 . 0 . j a r : $ H I V E _ H O M E / l i b / l i b t h r i f t 0 . 9 . 0 . j a r : $ H I V E _ H O M E / l i b / s l f 4 j a p i 1 . 6 . 4 . j a r : $ H I V E _ H O M E / c o n f : / e t c / h a d o o p / c o n f $ LIBJARS=`echo $HADOOP_CLASSPATH | sed -e 's/:/,/g'` $ export LIBJARS=$LIBJARS,$HIVE_HOME/lib/antlr-runtime-3.4.jar $ hadoop jar target/UseHCat-1.0.jar com.cloudera.test.UseHCat -files $HCATJAR -libjars $LIBJARS groups groupids
Output:
A: {name: chararray,placeholder: chararray,id: int}
Example output:
{ " c o l u m n s " : [ { " n a m e " : " n a m e " , " t y p e " : " s t r i n g " } , { " n a m e " : " p l a c e h o l d e r " , " t y p e " : " s t r i n g " } , { " n a m e " : " i d " , " t y p e " : " i n t " } ] , " d a t a b a s e " : " d e f a u l t " , " t a b l e " : " g r o u p t a b l e " }
HBase Installation
HBase Installation
Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster. Note: Install Cloudera Repository Before using the instructions on this page to install or upgrade, install the Cloudera yum, zypper/YaST or apt repository, and install or upgrade CDH4 and make sure it is functioning correctly. For instructions, see CDH4 Installation and the instructions for upgrading to CDH4 or upgrading from an earlier CDH4 release.
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) Use the following sections to install, update, and configure HBase: Upgrading HBase Installing HBase Configuration Settings Starting HBase in Standalone Mode Configuring HBase in Pseudo-Distributed Mode Deploying HBase in a Cluster Using the Hbase Shell Using MapReduce with HBase Troubleshooting Apache HBase Documentation HBase Replication Configuring Snapshots
Upgrading HBase
Note: To see which version of HBase is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes.
HBase Installation
Important: Check the Known Issues and Work Arounds in CDH4 and Incompatible Changesfor HBase before proceeding.
Step 1: Perform a Graceful Cluster Shutdown and Remove HBase To shut down the CDH3 version of HBase gracefully: 1. Stop the Thrift server and clients, then stop the cluster. a. Stop the Thrift server and clients:
sudo service hadoop-hbase-thrift stop
b. Stop the cluster by shutting down the master and the region servers: a. Use the following command on the master node:
sudo service hadoop-hbase-master stop
HBase Installation
2. Stop the ZooKeeper Server:
$ sudo service hadoop-zookeeper-server stop
Note: Depending on your platform and release, you may need to use
$ sudo /sbin/service hadoop-zookeeper-server stop
or
$ sudo /sbin/service hadoop-zookeeper stop
3. It is a good idea to back up the /hbaseznode before proceeding. By default, this is in /var/zookeeper. 4. If you have not already done so, remove the CDH3 version of ZooKeeper. See Upgrading ZooKeeper from CDH3 to CDH4. 5. Remove HBase: To remove HBase on Red-Hat-compatible systems:
$ sudo yum remove hadoop-hbase
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 2: Install the new version of HBase Follow directions under Installing HBase. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
HBase Installation
Step 3: Upgrade the LZO Plugin Note: Skip this step if you were not using the LZO plugin in CDH3. You need to make sure that the proper libraries (native and .jar) are on the classpath and the library path; otherwise your CDH3 LZO files will not be readable. Do this after installing the new version of HBase, but before starting the HBase Master or Region Servers. Proceed as follows. 1. Add the LZO JAR file from /usr/lib/hadoop-0.20/lib/ to the HBase classpath:
echo 'export HBASE_CLASSPATH=$HBASE_CLASSPATH:/usr/lib/hadoop-0.20/lib/hadoop-lzo-20101122174751.20101122171345.552b3f9.jar'> /etc/hbase/conf/hbase-env.sh
2. Copy the native libraries from the old location to the default location specified in java.library.path. For example, on an Red Hat 6 or CentOS 6 system:
cp /usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/* /usr/lib/hadoop/lib/native/
Note: The old location may be different on SLES, Ubuntu, or Debian systems.
b. Stop the cluster by shutting down the master and the region servers: a. Use the following command on the master node:
sudo service hbase-master stop
HBase Installation
Step 2: Install the new version of HBase Note: You may want to take this opportunity to upgrade ZooKeeper, but you do not have to upgrade Zookeeper before upgrading HBase; the new version of HBase will run with the older version of Zookeeper. For instructions on upgrading ZooKeeper, see Upgrading ZooKeeper to CDH4. It is a good idea to back up the /hbase znode before proceeding. By default, this is in /var/lib/zookeeper. To install the new version of HBase, follow directions in the next section, Installing HBase. Important: During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.
Installing HBase
To install HBase On Red Hat-compatible systems:
$ sudo yum install hbase
Note: See also Starting HBase in Standalone Mode on page 239, Configuring HBase in Pseudo-Distributed Mode on page 241, and Deploying HBase in a Distributed Cluster on page 243. To list the installed files on Ubuntu and Debian systems:
$ dpkg -L hbase
You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard. (To learn more, run man hier). You are now ready to enable the server daemons you want to use with Hadoop. You can also enable Java-based client access by adding the JAR files in /usr/lib/hbase/ and /usr/lib/hbase/lib/ to your Java class path.
HBase Installation
Configuring ulimit for HBase Cloudera recommends increasing the maximum number of file handles to more than 10,000. Note that increasing the file handles for the user who is running the HBase process is an operating system configuration, not an HBase configuration. Also, a common mistake is to increase the number of file handles for a particular user but, for whatever reason, HBase will be running as a different user. HBase prints the ulimit it is using on the first line in the logs. Make sure that it is correct. If you are using ulimit, you must make the following configuration changes: 1. In the /etc/security/limits.conf file, add the following lines:
hdfs hbase nofile nofile 32768 32768
HBase Installation
Note: Only the root user can edit this file. If this change does not take effect, check other configuration files in the /etc/security/limits.d directory for lines containing the hdfs or hbase user and the nofile value. Such entries may be overriding the entries in /etc/security/limits.conf. To apply the changes in /etc/security/limits.conf on Ubuntu and Debian systems, add the following line in the /etc/pam.d/common-session file:
session required pam_limits.so
Be sure to restart HDFS after changing the value for dfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number of xcievers will be added to the DataNode logs. Other error messages about missing blocks are also logged, such as:
10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
HBase Installation
To install the HBase Master on Ubuntu and Debian systems:
$ sudo apt-get install hbase-master
On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed. To verify that the standalone installation is operational, visit http://localhost:60010. The list of Region Servers at the bottom of the page should include one entry for your local machine. Note: Although you have just started the master process, in standalone mode this same process is also internally running a region server and a ZooKeeper peer. In the next section, you will break out these components into separate JVMs. If you see this message when you start the HBase standalone master:
Starting Hadoop HBase master daemon: starting master, logging to /usr/lib/hbase/logs/hbase-hbase-master/cloudera-vm.out Couldnt start ZK at requested address of 2181, instead got: 2182. Because clients (eg shell) wont be able to find this ZK quorum hbase-master.
Aborting. Why?
you will need to stop the hadoop-zookeeper-server (or zookeeper-server) or uninstall the hadoop-zookeeper-server (or zookeeper) package. See also Accessing HBase by using the HBase Shell, Using MapReduce with HBase and Troubleshooting.
HBase Installation
You can use the service command to run an init.d script, /etc/init.d/hbase-rest, to start the REST server; for example:
$ sudo service hbase-rest start
The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host. If you need change the port for the REST server, configure it in hbase-site.xml, for example:
<property> <name>hbase.rest.port</name> <value>60050</value> </property>
Note: You can use HBASE_REST_OPTS in hbase-env.sh to pass other settings (such as heap size and GC parameters) to the REST server JVM.
Pseudo-distributed mode differs from standalone mode in that each of the component processes run in a separate JVM.
Note: Before you start This section assumes you have already installed the HBase master and gone through the standalone configuration steps. If the HBase master is already running in standalone mode, stop it as follows before continuing with pseudo-distributed configuration: To stop the CDH3 version: sudo service hadoop-hbase-master stop, or To stop the CDH4 version if that version is already running: sudo service hbase-master stop
HBase Installation
Note: If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>
Starting an HBase RegionServer The RegionServer is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node. To enable the HBase RegionServer On Red Hat-compatible systems:
$ sudo yum install hbase-regionserver
HBase Installation
To start the RegionServer:
$ sudo service hbase-regionserver start
Verifying the Pseudo-Distributed Operation After you have started ZooKeeper, the Master, and a RegionServer, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using the jps tool from the Oracle JDK, which you can obtain from here. If you are running a pseudo-distributed HDFS installation and a pseudo-distributed HBase installation on one machine, jps will show the following output:
$ sudo jps 32694 Jps 30674 HRegionServer 29496 HMaster 28781 DataNode 28422 NameNode 30348 QuorumPeerMain
You should also be able to navigate to http://localhost:60010 and verify that the local region server has registered with the master.
See also Accessing HBase by using the HBase Shell on page 244, Using MapReduce with HBase on page 245 and Troubleshooting on page 245.
HBase Installation
Note: Before you start This section assumes that you have already installed the Installing the HBase Master on page 239 and the HBase Region Server and gone through steps for standalone and pseudo-distributed configuration. You are now about to distribute the processes across multiple hosts; see Choosing Where to Deploy the Processes on page 244.
To start the cluster, start the services in the following order: 1. The ZooKeeper Quorum Peer 2. The HBase Master 3. Each of the HBase RegionServers After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master. See also Accessing HBase by using the HBase Shell on page 244, Using MapReduce with HBase on page 245 and Troubleshooting on page 245. For instructions on improving the performance of local reads, see Tips and Guidelines on page 112.
HBase Installation
Version: 0.89.20100621+17, r, Mon Jun 28 10:13:32 PDT 2010 hbase(main):001:0> status 'detailed' version 0.89.20100621+17 0 regionsInTransition 1 live servers my-machine:59719 1277750189913 requests=0, regions=2, usedHeap=24, maxHeap=995 .META.,,1 stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0 -ROOT-,,0 stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0 0 dead servers
This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you do not need to edit the MapReduce configuration. You can find more information about addDependencyJars in the documentation listed under Viewing the HBase Documentation on page 246. When getting an Configuration object for a HBase MapReduce job, instantiate it using the HBaseConfiguration.create() method.
Troubleshooting
The Cloudera HBase packages have been configured to place logs in /var/log/hbase. Cloudera recommends tailing the .log files in this directory when you start HBase to check for any error messages or failures.
HBase Installation
HBase Replication
HBase replication provides a means of copying the data from one HBase cluster to another (typically distant) HBase cluster. It is designed for data recovery rather than failover. The cluster receiving the data from user applications is called the master cluster, and the cluster receiving the replicated data from the master is called the slave cluster.
Types of Replication
You can implement any of the following replication models: Master-slave replication Master-master replication Cyclic replication In all cases, the principle of replication is similar to that of MySQL master/slave replication in which each transaction on the master cluster is replayed on the slave cluster. In the case of HBase, the Write-Ahead Log (WAL) or HLog records all the transactions (Put/Delete) and the master cluster Region Servers ship the edits to the slave cluster Region Servers. This is done asynchronously, so having the slave cluster in a distant data center does not cause high latency at the master cluster. Master-Slave Replication This is the basic replication model, in which transactions on the master cluster are replayed on the slave cluster, as described above. For instructions on configuring master-slave replications, see Deploying HBase Replication. Master-Master Replication In this case, the slave cluster in one relationship can act as the master in a second relationship, and the slave in the second relationship can act as master in a third relationship, and so on. Cyclic Replication In the cyclic replication model, the slave cluster acts as master cluster for the original master. This sort of replication is useful when both the clusters are receiving data from different sources and you want each of these clusters to have the same data. Important: The normal configuration for cyclic replication is two clusters; you can configure more, but in that case loop detection is not guaranteed.
HBase Installation
In the case of master-master replication, you make the changes on both sides. Replication works at the table-column-family level. The family should exist on all the slaves. (You can have additional, non replicating families on both sides). The timestamps of the replicated HLog entries are kept intact. In case of a collision (two entries identical as to row key, column family, column qualifier, and timestamp) only the entry arriving later write will be read. Increment Column Values (ICVs) are treated as simple puts when they are replicated. In the master-master case, this may be undesirable, creating identical counters that overwrite one another. (See https://issues.apache.org/jira/browse/HBase-2804.) Make sure the master and slave clusters are time-synchronized with each other. Cloudera recommends you use Network Time Protocol (NTP).
Requirements
Before configuring replication, make sure your environment meets the following requirements: You must manage Zookeeper yourself. It must not be managed by HBase, and must be available throughout the deployment. Each host in both clusters must be able to reach every other host, including those in the Zookeeper cluster. Both clusters should have the same HBase and Hadoop major revision. For example, having 0.90.1 on the master and 0.90.0 on the slave is supported, but 0.90.1 on one cluster and 0.89.20100725 on the other is not. Every table that contains families that are scoped for replication must exist on each cluster and have exactly the same name. HBase version 0.92 or greater is required for multiple slaves, master-master, or cyclic replication. This version ships with CDH4.0.0.
2. Push hbase-site.xml to all nodes. 3. Restart HBase if it was running. 4. Run the following command in the HBase master's shell while it's running:
add_peer
This will show you the help for setting up the replication stream between the clusters. The command takes the form:
add_peer '<n>', "slave.zookeeper.quorum:zookeeper.clientport.:zookeeper.znode.parent"
where <n> is the peer ID; it should not be more than two characters (longer IDs may work, but have not been tested). Example:
hbase> add_peer '1', "zk.server.com:2181:/hbase"
HBase Installation
Note: If both clusters use the same Zookeeper cluster, you need to use a different zookeeper.znode.parent for each so that they don't write to the same directory. 5. Once you have a peer, enable replication on your column families. One way to do this is to alter the table and set the scope like this:
disable 'your_table' alter 'your_table', {NAME => 'family_name', REPLICATION_SCOPE => '1'} enable 'your_table'
Currently, a scope of 0 (default) means that the data will not be replicated and a scope of 1 means that it will. This could change in the future. 6. To list all configured peers, run the following command in the master's shell:
list_peers
You can confirm that your setup works by looking at any Region Server's log on the master cluster; look for the lines such as the following:
Considering 1 rs, with ratio 0.1 Getting 1 rs from peer cluster # 0 Choosing peer 170.22.64.15:62020
This indicates that one Region Server from the slave cluster was chosen for replication. Deploying Master-Master or Cyclic Replication For master-master or cyclic replication, repeat the above steps on each master cluster: add the hbase.replication property and set it to true, push the resulting hbase-site.xml to all nodes of this master cluster, use add_peer to add the slave cluster, and enable replication on the column families.
To re-enable peer 1:
enable_peer("1")
HBase Installation
Warning: Do this only in case of a serious problem; it may cause data loss. To stop replication in an emergency: Open the shell on the master cluster and and use the stop_replication command. For example:
hbase(main):001:0> stop_replication
Already queued edits will be replicated after you use the stop_replication command, but new entries will not. To start replication again, use the start_replication command.
Replicating Pre-existing Data in a Master-Master Deployment In the case of master-master replication, run the copyTable job before starting the replication. (If you start the job after enabling replication, the second master will re-send the data to the first master, because copyTable does not edit the clusterId in the mutation objects. Proceed as follows: 1. Run the copyTable job and note the start timestamp of the job. 2. Start replication. 3. Run the copyTable job again with a start time equal to the start time you noted in step 1. This results in some data being pushed back and forth between the two clusters; but it minimizes the amount of data.
HBase Installation
a table name, and a column family. Other options allow you to specify a time range and specific families. This job's short name is verifyrep; provide that name when pointing hadoop jar to the HBase JAR file. The command has the following form:
hadoop jar $HBASE_HOME/hbase-<version>.jar verifyrep [--starttime=timestamp1] [--stoptime=timestamp [--families=comma separated list of families] <peerId> <tablename>
The command prints out GOODROWS and BADROWS counters; these correspond to replicated and non-replicated rows respectively.
Caveats
Two variables govern replication: hbase.replication as described above under Deploying HBase Replication, and a replication znode. Stopping replication (using stop_replication as above) sets the znode to false. Two problems can result: If you add a new Region Server to the master cluster while replication is stopped, its current log will not be added to the replication queue, because the replication znode is still set to false. If you restart replication at this point (using start_replication), entries in the log will not be replicated. Similarly, if a logs rolls on an existing Region Server on the master cluster while replication is stopped, the new log will not be replicated, because the the replication znode was set to false when the new log was created. Loop detection is not guaranteed if you use cyclic replication among more than two clusters. In the case of a long-running, write-intensive workload, the slave cluster may become unresponsive if its meta-handlers are blocked while performing the replication. CDH4.1 introduces three new properties to deal with this problem: hbase.regionserver.replication.handler.count - the number of replication handlers in the slave cluster (default is 3). Replication is now handled by separate handlers in the slave cluster to avoid the above-mentioned sluggishness. Increase it to a high value if the ratio of master to slave RegionServers is high. replication.sink.client.retries.number - the number of times the HBase replication client at the sink cluster should retry writing the WAL entries (default is 1). replication.sink.client.ops.timeout - the timeout for the HBase replication client at the sink cluster (default is 20 seconds).
HBase Installation
HBase Snapshots allow you to clone a table without making data copies, and with minimal impact on Region Servers. Exporting the table to another cluster should not have any impact on the the region servers. Use Cases Recovery from user or application errors Useful because it may be some time before the database administrator notices the error Note: The database administrator needs to schedule the intervals at which to take and delete snapshots. Use a script or your preferred management tool for this; it is not built into HBase. The database administrator may want to save a snapshot right before a major application upgrade or change. Note: Snapshots are not primarily used for system upgrade protection because they would not roll back binaries, and would not necessarily be proof against bugs or errors in the system or the upgrade.
Sub-cases for recovery: Rollback to previous snapshot and merge in reverted data View previous snapshots and selectively merge them into production Backup Capture a copy of the database and store it outside HBase for disaster recovery Capture previous versions of data for compliance, regulation, archiving Export from snapshot on live system provides a more consistent view of HBase than Copy Table and
Export Table
Audit and/or report view of data at a specific time Capture monthly data for compliance Use for end-of-day/month/quarter reports Use for Application testing Test schema or application changes on like production data from snapshot and then throw away For example: take a snapshot; create a new table from the snapshot content (schema plus data); manipulate the new table by changing the schema, adding and removing rows, and so on (the original table, the snapshot, and the new table remain independent of each other) Offload work Capture, copy, and restore data to another site Export data to another cluster Where Snapshots Are Stored The snapshot metadata is stored in the .snapshot directory under the hbase root directory (/hbase/.snapshot). Each snapshot has its own directory that includes all the references to the hfiles, logs, and metadata needed to restore the table.
HBase Installation
hfiles needed by the snapshot are in the traditional /hbase/<tableName>/<regionName>/<familyName>/
location if the table is still using them; otherwise they will be placed in
/hbase/.archive/<tableName>/<regionName>/<familyName>/
Zero-copy Restore and Clone Table From a snapshot you can create a new table (clone operation) or restore the original table. These two operations do not involve data copies; instead a link is created to point to the original hfiles. Changes to a cloned or restored table do not affect the snapshot or (in case of a clone) the original table. If you want to clone a table to another cluster, you need to export the snapshot to the other cluster and then execute the clone operation; see Exporting a Snapshot to Another Cluster. Reverting to a Previous HBase Version Snapshots dont affect HBase backward compatibility if they are not used. If you do use the snapshot capability, backward compatibility is affected as follows: If you only take snapshots, you can still go back to a previous HBase version If you have used restore or clone, you cannot go back to a previous version unless the cloned or restored tables have no links (there is no automated way to check; you would need to inspect the file system manually). Storage Considerations Since the hfiles are immutable, a snapshot consists of reference to the files that are in the table at the moment the snapshot is taken. No copies of the data are made during the snapshot operation, but copies may be made when a compaction or deletion is triggered. In this case, if a snapshot has a reference to the files to be removed, the files are moved to an archive folder, instead of being deleted. This allows the snapshot to be restored in full. Because no copies are performed, multiple snapshots share the same hfiles, but in the worst case scenario, each snapshot could have different set of hfiles (tables with lots of updates, and compactions).
To disable snapshots after you have enabled them, set hbase.snapshot.enabled to false. Note: If you have taken snapshots and then decide to disable snapshots, you must delete the snapshots before restarting the HBase master; the HBase master will not start if snapshots are disabled and snapshots exist. Snapshots dont affect HBase performance if they are not used.
HBase Installation
Shell Commands
You can manage snapshots by using the HBase shell or the HBaseAdmin Java API. The following table shows actions you can take from the shell:
HBase Installation
Action
Shell command
Comments
Snapshots can be taken while a table is disabled, or while a table is online and serving traffic. If a table is disabled (via disable <table>) an offline snapshot is taken. This snapshot is driven by the master and fully consistent with the state when the table was disabled. This is the simplest and safest method, but it involves a service interruption since the table must be disabled to take the snapshot. In an online snapshot, the table remains available while the snapshot is taken, and should incur minimal noticeable performance degradation of normal read/write loads. This snapshot is coordinated by the master and run on the region servers. The current implementation - simple-flush snapshots - provides no causal consistency guarantees. Despite this shortcoming, it offers the same degree of consistency as Copy Table and overall is a huge improvement over Copy Table.
Restore snapshot snapshotX (it will replace the source table content)
restore_snapshot snapshotX
Restoring a snapshot attempts to replace the current version of a table with another version of the table. To run this command, you must disable the target table. The restore command takes a snapshot of the table (appending a timestamp code), and then essentially clones data into the original data and removes data not in the snapshot. If the operation succeeds, the target table will be enabled. Use this capability only in an emergency; see Restrictions.
list_snapshots
HBase Installation
Action List all available snapshots starting with mysnapshot_ (regular expression)
Shell command
list_snapshots my_snapshot_*
Comments
delete_snapshot snapshotX
Cloning a snapshot creates a new read/write table that can serve the data kept at the time of the snapshot. The original table and the cloned table can be modified independently without interfering new data written to one table will not show up on the other.
To export the snapshot and change the ownership of the files during the copy:
hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs:///srv2:8082/hbase -chuser MyUser -chgroup MyGroup -chmod 700 -mappers 16
You can also use the Java -D option in many tools to specify MapReduce or other configuration. properties.
HBase Installation
Restrictions
Warning: Do not use merge in combination with snapshots. Merging two regions can cause data loss if snapshots or cloned tables exist for this table. The merge is likely to corrupt the snapshot and any tables cloned from the snapshot. In addition, if the table has been restored from a snapshot, the merge may also corrupt the table. The snapshot may survive intact if the regions being merged are not in the snapshot, and clones may survive if they do not share files with the original table or snapshot. You can use the Snapinfo tool (see Information and Debugging on page 257) to check the status of the snapshot. If the status is BROKEN, the snapshot is unusable. All the Masters and Region Servers must be running at least CDH4.2. If you have enabled the AccessController Coprocessor for HBase, only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights. This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them. Do not take, clone, or restore a snapshot during a rolling restart. Snapshots rely on the Region Servers being up; otherwise the snapshot will fail. Note: This restriction also applies to rolling upgrade, which can currently be done only via Cloudera Manager. If you are using HBase Replication and you need to restore a snapshot: If you are using HBase Replication the replicas will be out of synch when you restore a snapshot. Do this only in an emergency. Important: Snapshot restore is an emergency tool; you need to disable the table and table replication to get to an earlier state, and you may lose data in the process. If you need to restore a snapshot, proceed as follows: 1. 2. 3. 4. Disable the table that is the restore target, and stop the replication Remove the table from both the master and slave clusters Restore the snapshot on the master cluster Create the table on the slave cluster and use Copy Table to initialize it. Note: If this is not an emergency (for example, if you know that you have lost just a set of rows such as the rows starting with "xyz"), you can create a clone from the snapshot and create a MapReduce job to copy the data that you've lost. In this case you don't need to stop replication or disable your main table.
Snapshot Failures
Region moves, splits, and other metadata actions that happen while a snapshot is in progress will probably cause the snapshot to fail; the software detects and rejects corrupted snapshot attempts.
HBase Installation
ZooKeeper Installation
ZooKeeper Installation
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) Apache ZooKeeper is a highly reliable and available service that provides coordination between distributed processes. Note: For More Information From the Apache ZooKeeper site: "ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs." To learn more about Apache ZooKeeper, visit http://zookeeper.apache.org/.
Note: To see which version of ZooKeeper is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes. Use the following sections to install, upgrade and administer ZooKeeper: Upgrading from CDH3 Upgrading from an Earlier CDH4 Release Installing ZooKeeper Maintaining the Server Apache ZooKeeper Documentation
ZooKeeper Installation
To upgrade ZooKeeper from CDH3 to CDH4, uninstall the CDH3 version (if you have not already done so) and then install the CDH4 version. Proceed as follows. Note: If you have already performed the steps to uninstall CDH3 described under Upgrading from CDH3 to CDH4, you can skip Step 1 below and proceed with Step 2.
or
$ sudo service hadoop-zookeeper stop
depending on the platform and release. 2. Remove CDH3 ZooKeeper To remove ZooKeeper on Red Hat-compatible systems:
$ sudo yum remove hadoop-zookeeper-server $ sudo yum remove hadoop-zookeeper
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up. To remove ZooKeeper on SLES systems:
$ sudo zypper remove hadoop-zookeeper-server $ sudo zypper remove hadoop-zookeeper
ZooKeeper Installation
Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
In any case, after installing the new CDH4 ZooKeeper packages, verify that the dataDir (and potentially dataLogDir) specified in the CDH4 /etc/zookeeper/conf/zoo.cfg point to a valid ZooKeeper directory. If you previously modified your CDH3 zoo.cfg configuration file (/etc/zookeeper.dist/zoo.cfg), RPM uninstall and re-install (using yum remove as in Step 1) renames and preserves a copy of your modified zoo.cfg as /etc/zookeeper.dist/zoo.cfg.rpmsave. You should compare this to the new /etc/zookeeper/conf/zoo.cfg and resolve any differences that should be carried forward (typically where you have changed property value defaults). If your CDH3 zoo.cfg file has not been modified since installation, it will be auto-deleted when the CDH3 ZooKeeper package is removed.
ZooKeeper Installation
then restarting the server. The server will automatically rejoin the quorum, update its internal state with the current ZooKeeper leader, and begin serving client sessions. This method allows you to upgrade ZooKeeper without any interruption in the service, and also lets you monitor the ensemble as the upgrade progresses, and roll back if necessary if you run into problems. The instructions that follow assume that you are upgrading ZooKeeper as part of a CDH4 upgrade, and have already performed the steps under Upgrading from an Earlier CDH4 Release.
Step 2: Install the ZooKeeper Base Package on the First Node See Installing the ZooKeeper Base Package. Step 3: Install the ZooKeeper Server Package on the First Node See Installing the ZooKeeper Server Package. Important: During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.
Step 4: Restart the Server See Installing the ZooKeeper Server Package for instructions on starting the server. The upgrade is now complete on this server and you can proceed to the next. Step 5: Upgrade the Remaining Nodes Repeat Steps 1-4 above on each of the remaining nodes. The ZooKeeper upgrade is now complete.
ZooKeeper Installation
The zookeeper-server package contains the init.d scripts necessary to run ZooKeeper as a daemon process. Because zookeeper-server depends on zookeeper, installing the server package automatically installs the base package. Note: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands to install ZooKeeper. For instructions, see CDH4 Installation.
Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server
The instructions provided here deploy a single ZooKeeper server in "standalone" mode. This is appropriate for evaluation, testing and development purposes, but may not provide sufficient reliability for a production application. See Installing ZooKeeper in a Production Environment for more information. To install the ZooKeeper Server On Red Hat-compatible systems:
$ sudo yum install zookeeper-server
To start ZooKeeper Note: ZooKeeper may start automatically on installation on Ubuntu and other Debian systems. This automatic start will happen only if the data directory exists; otherwise you will be prompted to initialize as shown below. To start ZooKeeper after an upgrade from CDH3 or CDH4:
ZooKeeper Installation
Important: If you are upgrading from CDH3, do not proceed with restarting the server until you have completed Step 4 of the upgrade procedure.
Note: If you are deploying multiple ZooKeeper servers after a fresh install, you need to create a myid file in the data directory. You can do this by means of an init command option: $ sudo service
zookeeper-server init --myid=1
ZooKeeper Installation
ZooKeeper is highly reliable because a persistent copy is replicated on each server, recovering from backups may be necessary if a catastrophic failure or user error occurs. When you use the default configuration, the ZooKeeper server does not remove the snapshots and log files, so they will accumulate over time. You will need to clean up this directory occasionally, taking into account on your backup schedules and processes. To automate the cleanup, a zkCleanup.sh script is provided in the bin directory of the zookeeper base package. Modify this script as necessary for your situation. In general, you want to run this as a cron task based on your backup schedule. The data directory is specified by the dataDir parameter in the ZooKeeper configuration file, and the data log directory is specified by the dataLogDir parameter. For more information, see Ongoing Data Directory Cleanup.
Whirr Installation
Whirr Installation
Apache Whirr is a set of libraries for running cloud services. You can use Whirr to run CDH4 clusters on cloud providers' clusters, such as Amazon Elastic Compute Cloud (Amazon EC2). There's no need to install the RPMs for CDH4 or do any configuration; a working cluster will start immediately with one command. It's ideal for running temporary Hadoop clusters to carry out a proof of concept, or to run a few one-time jobs. When you are finished, you can destroy the cluster and all of its data with one command. Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands to install or update Whirr. For instructions, see CDH4 Installation. Use the following sections to install, upgrade, and deploy Whirr: Upgrading Whirr Installing Whirr Generating an SSH Key Pair Defining a Cluster Launching a Cluster Apache Whirr Documentation
Upgrading Whirr
Note: To see which version of Whirr is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes.
Step 1: Remove Whirr 1. Stop the Whirr proxy. Kill the hadoop-proxy.sh process by pressing Control-C. 2. Destroy the Cluster. Whirr clusters are normally short-lived. If you have a running cluster, destroy it: see Destroying a cluster. 3. Uninstall the CDH3 version of Whirr:
Whirr Installation
On Red Hat-compatible systems:
$ sudo yum remove whirr
On SLES systems:
$ sudo zypper remove whirr
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up. 4. Update the Properties File. Edit the configuration file, called hadoop.properties in these instructions, and save it. For Hadoop, configure the following properties as shown: For MRv1:
whirr.env.repo=cdh4 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop
For YARN: see Defining a Cluster. For HBase, configure the following properties as shown:
whirr.env.repo=cdh4 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.hbase.install-function=install_cdh_hbase whirr.hbase.configure-function=configure_cdh_hbase whirr.zookeeper.install-function=install_cdh_zookeeper whirr.zookeeper.configure-function=configure_cdh_zookeeper
See Defining a Whirr Cluster for a sample file. Important: If you are upgrading from Whirr version 0.3.0, and are using an explicit image (AMI), make sure it comes from one of the supplied Whirr recipe files.
Step 2: Install the new Version See the next section, Installing Whirr.
Whirr Installation
The upgrade is now complete. For more information, see Defining a Whirr Cluster, Launching a Cluster, and Viewing the Whirr Documentation.
Upgrading Whirr from an Earlier CDH4 Release to the Latest CDH4 Release
Step 1: Stop the Whirr proxy. Kill the hadoop-proxy.sh process by pressing Control-C. Step 2: Destroy the Cluster. Whirr clusters are normally short-lived. If you have a running cluster, destroy it: see Destroying a cluster. Step 3: Install the New Version of Whirr See Installing Whirr. The upgrade is now complete. For more information, see Defining a Whirr Cluster, Launching a Cluster, and Viewing the Whirr Documentation.
Installing Whirr
To install Whirr on an Ubuntu or other Debian system:
$ sudo apt-get install whirr
To install Whirr on another system: Download a Whirr tarball from here. To verify Whirr is properly installed:
$ whirr version
Whirr Installation
Note: If you specify a non-standard location for the key files in the ssh-keygen command (that is, not ~/.ssh/id_rsa), then you must specify the location of the private key file in the whirr.private-key-file property and the public key file in the whirr.public-key-file property. For more information, see the next section.
MRv1 Cluster
The following file defines a cluster with a single machine for the NameNode and JobTracker, and another machine for a DataNode and TaskTracker.
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker whirr.provider=aws-ec2 whirr.identity=<cloud-provider-identity> whirr.credential=<cloud-provider-credential> whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.env.repo=cdh4 whirr.hadoop-install-function=install_cdh_hadoop whirr.hadoop-configure-function=configure_cdh_hadoop whirr.hardware-id=m1.large whirr.image-id=us-east-1/ami-ccb35ea5 whirr.location-id=us-east-1
YARN Cluster
The following configuration provides the essentials for a YARN cluster. Change the number of instances for hadoop-datanode+yarn-nodemanager from 2 to a larger number if you need to.
whirr.cluster-name=myhadoopcluster whirr.instance-templates=1
Whirr Installation
hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager whirr.provider=aws-ec2 whirr.identity=<cloud-provider-identity> whirr.credential=<cloud-provider-credential> whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub whirr.env.mapreduce_version=2 whirr.env.repo=cdh4 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory whirr.yarn.configure-function=configure_cdh_yarn whirr.yarn.start-function=start_cdh_yarn whirr.hardware-id=m1.large whirr.image-id=us-east-1/ami-ccb35ea5 whirr.location-id=us-east-1
Launching a Cluster
To launch a cluster:
$ whirr launch-cluster --config hadoop.properties
As the cluster starts up, messages are displayed in the console. You can see debug-level log messages in a file named whirr.log in the directory where you ran the whirr command. After the cluster has started, a message appears in the console showing the URL you can use to access the web UI for Whirr.
Whirr Installation
2. If you are using an Ubuntu, Debian, or SLES system, type these commands:
$ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.whirr 50 $ update-alternatives --display hadoop-conf
For YARN:
$ export $ hadoop $ hadoop $ hadoop output $ hadoop HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce fs -mkdir input fs -put $HADOOP_MAPRED_HOME/CHANGES.txt input jar $HADOOP_MAPRED_HOME/hadoop-mapreduce-examples.jar wordcount input fs -cat output/part-* | head
Destroying a cluster
When you are finished using a cluster, you can terminate the instances and clean up the resources using the commands shown in this section. WARNING All data will be deleted when you destroy the cluster. To destroy a cluster: 1. Run the following command to destroy a cluster:
$ whirr destroy-cluster --config hadoop.properties
2. Shut down the SSH proxy to the cluster if you started one earlier.
Snappy Installation
Snappy Installation
Snappy is a compression/decompression library. It aims for very high speeds and reasonable compression, rather than maximum compression or compatibility with other compression libraries. Use the following sections to install, upgrade, and use Snappy. Upgrading Snappy Installing Snappy Using Snappy for MapReduce Compression Using Snappy for Pig Compression Using Snappy for Hive Compression Using Snappy Compression in Sqoop Imports Using Snappy Compression with HBase Apache Snappy Documentation
Snappy Installation
Snappy is provided in the hadoop package along with the other native libraries (such as native gzip compression). To take advantage of Snappy compression you need to set certain configuration properties, which are explained in the following sections.
Snappy Installation
For MRv1:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
For YARN:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property> <property> <name>mapred.map.output.compress.codec</name> <value>org.apache.hadoop.io.compress.SnappyCodec</value> </property>
You can also set these properties on a per-job basis. Use the properties in the following table to compress the final output of a MapReduce job. These are usually set on a per-job basis. MRv1 Property YARN Property Description
mapred.output. compress
mapred.output. compression.codec
If the final job outputs are to be compressed, which codec should be used. Set to
org.apache.hadoop.io.compress.SnappyCodec
mapred.output. compression.type
For SequenceFile outputs, what type of compression should be used (NONE, RECORD, or BLOCK). BLOCK is recommended.
Note: The MRv1 property names are also supported (though deprecated) in MRv2 (YARN), so it's not mandatory to update them in this release.
Snappy Installation
It is a good idea to use the --as-sequencefile option with this compression option.
Mahout Installation
Mahout Installation
Apache Mahout is a machine-learning tool. By enabling you to build machine-learning libraries that are scalable to "reasonably large" datasets, it aims to make building intelligent applications easier and faster. The main use cases for Mahout are: Recommendation mining, which tries to identify things users will like on the basis of their past behavior (for example shopping or online-content recommendations) Clustering, which groups similar items (for example, documents on similar topics) Classification, which learns from existing categories what members of each category have in common, and on that basis tries to categorize new items Frequent item-set mining, which takes a set of item-groups (such as terms in a query session, or shopping-cart content) and identifies items that usually appear together Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the instructions below to install Mahout. For instructions, see CDH4 Installation. Use the following sections to install, update and use Mahout: Upgrading Mahout Installing Mahout The Mahout Executable Getting Started The Apache Mahout Wiki
Upgrading Mahout
Note: To see which version of Mahout is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes.
Mahout Installation
To remove Mahout on a SLES system:
$ sudo zypper remove mahout
Warning: If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.
Step 2: Install CDH4 Mahout See Installing Mahout. Important: During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.
Upgrading Mahout from an Earlier CDH4 Release to the Latest CDH4 Release
To upgrade Mahout to the latest release, simply install the new version; see Installing Mahout. Important: During package upgrade, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave, and creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original configuration file to the new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you have made changes to a file for which there is a new version; for details, see Automatic handling of configuration files by dpkg.
Installing Mahout
You can install Mahout from an RPM or Debian package, or from a tarball. Installing from packages is more convenient than installing the tarball because the packages: Handle dependencies Provide for easy upgrades Automatically install resources to conventional locations These instructions assume that you will install from packages if possible.
Mahout Installation
To install Mahout on a Red Hat system:
$ sudo yum install mahout
To install Mahout on a system for which packages are not available: Download a Mahout tarball from here
HttpFS Installation
HttpFS Installation
Note: Running Services When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).) Use the following sections to install and configure HttpFS: About HttpFS Packaging Prerequisites Installing HttpFS Configuring HttpFS Starting the Server Stopping the Server Using the Server with curl
About HttpFS
Apache Hadoop HttpFS is a service that provides HTTP access to HDFS. HttpFS has a REST HTTP API supporting all HDFS File System operations (both read and write). Common HttpFS use cases are: Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages other than Java (such as Perl). Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCp. Read and write data in HDFS in a cluster behind a firewall. (The HttpFS server acts as a gateway and is the only system that is allowed to send and receive data through the firewall). HttpFS supports Hadoop pseudo-authentication, HTTP SPNEGO Kerberos, and additional authentication mechanisms via a plugin API. HttpFS also supports Hadoop proxy user functionality. The webhdfs client file system implementation can access HttpFS via the Hadoop filesystem command (hadoop fs), by using Hadoop DistCp, and from Java applications using the Hadoop file system Java API. The HttpFS HTTP REST API is interoperable with the WebHDFS REST HTTP API. For more information about HttpFS, see http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-hdfs-httpfs/index.html.
HttpFS Installation
HttpFS Packaging
There are two packaging options for installing HttpFS: The hadoop-httpfs RPM package The hadoop-httpfs Debian package You can also download a Hadoop tarball, which includes HttpFS, from here.
HttpFS Prerequisites
Prerequisites for installing HttpFS are: A Unix-like system: see CDH4 Requirements and Supported Versions for details Java: see Java Development Kit Installation for details Note: To see which version of HttpFS is shipping in CDH4, check the Version and Packaging Information. For important information on new and changed components, see the CDH4 Release Notes. CDH4 Hadoop works with the CDH4 version of HttpFS.
Installing HttpFS
HttpFS is distributed in the hadoop-httpfs package. To install it, use your preferred package manager application. Install the package on the system that will run the HttpFS server. Important: If you have not already done so, install Cloudera's Yum, zypper/YaST or Apt repository before using the following commands to install HttpFS. For instructions, see CDH4 Installation. To install the HttpFS package on a Red Hat-compatible system:
$ sudo yum install hadoop-httpfs
Note: Installing the httpfs package creates an httpfs service configured to start HttpFS at system startup time.
HttpFS Installation
You are now ready to configure HttpFS. See the next section.
Configuring HttpFS
When you install HttpFS from an RPM or Debian package, HttpFS creates all configuration, documentation, and runtime files in the standard Unix directories, as follows. Type of File Where Installed
Binaries
/usr/lib/hadoop-httpfs/
Configuration
/etc/hadoop-httpfs/conf/
Documentation
for SLES:
/usr/share/doc/packages/hadoop-httpfs/
Data
/var/lib/hadoop-httpfs/
Logs
/var/log/hadoop-httpfs/
temp
/var/tmp/hadoop-httpfs/
PID file
/var/run/hadoop-httpfs/
HttpFS Installation
If you see the message Server httpfs started!, status NORMAL in the httpfs.log log file, the system has started successfully. Note: By default, HttpFS server runs on port 14000 and its URL is http://<HTTPFS_HOSTNAME>:14000/webhdfs/v1.
HttpFS Installation
You should see output such as this:
HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Set-Cookie: hadoop.auth="u=babu&p=babu&t=simple&e=1332977755010&s=JVfT4T785K4jeeLNWXK68rc/0xI="; Version=1; Path=/ Content-Type: application/json Transfer-Encoding: chunked Date: Wed, 28 Mar 2012 13:35:55 GMT {"Path":"\/user\/babu"}
See the WebHDFS REST API web page for complete documentation of the API.
Avro Usage
Avro Usage
Apache Avro is a serialization system. Avro supports rich data structures, a compact binary encoding, and a container file for sequences of Avro data (often referred to as "Avro data files"). Avro is designed to be language-independent and there are several language bindings for it, including Java, C, C++, Python, and Ruby. Avro does not rely on generated code, which means that processing data imported from Flume or Sqoop is simpler than using Hadoop Writables in Sequence Files, where you have to take care that the generated classes are on the processing job's classpath. Furthermore, Pig and Hive cannot easily process Sequence Files with custom Writables, so users often revert to using text, which has disadvantages from a compactness and compressibility point of view (compressed text is not generally splittable, making it difficult to process efficiently using MapReduce). All components in CDH4 that produce or consume files support Avro data files as a file format. But bear in mind that because uniform Avro support is new, there may be some rough edges or missing features. The following sections contain brief notes on how to get started using Avro in the various CDH4 components: Avro Data Files Compression Flume Sqoop MapReduce Streaming Pig Hive Avro Tools
Compression
By default Avro data files are not compressed, but it is generally advisable to enable compression to reduce disk usage and increase read and write performance. Avro data files support Deflate and Snappy compression. Snappy is faster, while Deflate is slightly more compact. You do not need to do any additional configuration to read a compressed Avro data file rather than an uncompressed one. However, to write an Avro data file you need to specify the type of compression to use. How you specify compression depends on the component being used, as explained in the sections below.
Avro Usage
Flume
The HDFSEventSink that is used to serialize event data onto HDFS supports plugin implementations of EventSerializer interface. Implementations of this interface have full control over the serialization format and can be used in cases where the default serialization format provided by the Sink does not suffice. An abstract implementation of the EventSerializer interface is provided along with Flume, called the AbstractAvroEventSerializer. This class can be extended to support custom schema for Avro serialization over HDFS. A simple implementation that maps the events to a representation of String header map and byte payload in Avro is provided by the class FlumeEventAvroEventSerializer which can be used by setting the serializer property of the Sink as follows: <agent-name>.sinks.<sink-name>.serializer = AVRO_EVENT
Sqoop
On the command line, use the following option to import to Avro data files:
--as-avrodatafile
Sqoop will automatically generate an Avro schema that corresponds to the database table being exported from. To enable Snappy compression, add the following option:
--compression-codec snappy
MapReduce
The Avro MapReduce API is an Avro module for running MapReduce programs which produce or consume Avro data files. If you are using Maven, simply add the following dependency to your POM:
<dependency> <groupId>org.apache.avro</groupId> <artifactId>avro-mapred</artifactId> <version>1.7.3</version> <classifier>hadoop2</classifier> </dependency>
Then write your program using the Avro MapReduce javadoc for guidance. At runtime, include the avro and avro-mapred JARs in the HADOOP_CLASSPATH; and the avro, avro-mapred and paranamer JARs in -libjars. To enable Snappy compression on output files call AvroJob.setOutputCodec(job, "snappy") when configuring the job. You will also need to include the snappy-java JAR in -libjars.
Avro Usage
Streaming
To read from Avro data files from a streaming program, specify org.apache.avro.mapred.AvroAsTextInputFormat as the input format. This input format will convert each datum in the Avro data file to a string. For a "bytes" schema, this will be the raw bytes, while in the general case it will be a single-line JSON representation of the datum. To write to Avro data files from a streaming program, specify org.apache.avro.mapred.AvroTextOutputFormat as the output format. This output format will create Avro data files with a "bytes" schema, where each datum is a tab-delimited key-value pair. At runtime specify the avro, avro-mapred and paranamer JARs in -libjars in the streaming command. To enable Snappy compression on output files, set the property avro.output.codec to snappy. You will also need to include the snappy-java JAR in -libjars.
Pig
CDH provides AvroStorage for Avro integration in Pig. To use it, first register the piggybank JAR file and supporting libraries:
REGISTER REGISTER REGISTER REGISTER piggybank.jar lib/avro-1.7.3.jar lib/json-simple-1.1.jar lib/snappy-java-1.0.4.1.jar
Pig maps the Avro schema to a corresponding Pig schema. You can store data in Avro data files with:
store b into 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
In the case of store, Pig generates an Avro schema from the Pig schema. It is possible to override the Avro schema, either by specifying it literally as a parameter to AvroStorage, or by using the same schema as an existing Avro data file. See the Pig wiki for details. To store two relations in one script, specify an index to each store function. Here is an example:
set1 = load 'input1.txt' using PigStorage() as ( ... ); store set1 into 'set1' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1'); set2 = load 'input2.txt' using PigStorage() as ( ... ); store set2 into 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index', '2');
For more information, see the AvroStorage wiki; look for "index". To enable Snappy compression on output files do the following before issuing the STORE statement:
SET mapred.output.compress true SET mapred.output.compression.codec org.apache.hadoop.io.compress.SnappyCodec SET avro.output.codec snappy
Avro Usage
There is some additional documentation on the Pig wiki. Note, however, that the version numbers of the JAR files to register are different on that page, so you should adjust them as shown above.
Hive
The following example demonstrates how to create a Hive table that is backed by Avro data files:
CREATE TABLE doctors ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "doctors", "type": "record", "fields": [ { "name":"number", "type":"int", "doc":"Order of playing the role" }, { "name":"first_name", "type":"string", "doc":"first name of actor playing role" }, { "name":"last_name", "type":"string", "doc":"last name of actor playing role" }, { "name":"extra_field", "type":"string", "doc:":"an extra field not in the original file", "default":"fishfingers and custard" } ] }'); LOAD DATA LOCAL INPATH '/usr/share/doc/hive-0.7.1+42.55/examples/files/doctors.avro' INTO TABLE doctors;
You could also create a Avro backed Hive table by using an Avro schema file:
CREATE TABLE my_avro_table(notused INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///tmp/schema.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
The avro.schema.url is a URL (here a file:// URL) pointing to an Avro schema file that is used for reading and writing, it could also be an hdfs URL, eg. hdfs://hadoop-namenode-uri/examplefile To enable Snappy compression on output files, run the following before writing to the table:
SET hive.exec.compress.output=true; SET avro.output.codec=snappy;
Avro Usage
You will also need to include the snappy-java JAR in --auxpath. The snappy-java JAR is located at:
/usr/lib/hive/lib/snappy-java-1.0.4.1.jar
Haivvreo SerDe has been merged into Hive as AvroSerDe, and it is no longer supported in its original form. schema.url and schema.literal have been changed to avro.schema.url and avro.schema.literal as a result of the merge. If you were you using Haivvreo SerDe, you can use the new Hive AvroSerDe with tables created with the Haivvreo SerDe. For example, if you have a table my_avro_table that uses the Haivvreo SerDe, you can do the following to make the table use the new AvroSerDe:
ALTER TABLE my_avro_table SET SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'; ALTER TABLE my_avro_table SET FILEFORMAT INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
Avro Tools
Avro provides a set of tools for working with Avro data files and schemas. The tools are not (currently) packaged with CDH, but you can download the tools JAR from an Apache mirror, and run it as follows to get a list of commands:
java -jar avro-tools-1.7.3.jar
See also RecordBreaker for information on turning text data into structured Avro data.
Sentry Installation
Sentry Installation
Sentry enables role-based, fine-grained authorization for HiveServer2 and Cloudera Impala. It provides classic database-style authorization for Hive and Impala. For more information, and instructions on configuring Sentry for Hive, see Configuring Sentry.
Installing Sentry
Important: If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using the following commands. For instructions, see CDH4 Installation. 1. Install Sentry as follows, depending on your operating system: On Red Hat and similar systems:
$ sudo yum install sentry
On SLES systems:
$ sudo zypper install sentry
ZooKeeper
Cloudera recommends starting ZooKeeper before starting HDFS; this is a requirement in a high-availability (HA) deployment. In any case, always start ZooKeeper before HBase.
Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server; Installing ZooKeeper in a Production Environment; HDFS High Availability Initial Deployment; Configuring High Availability for the JobTracker (MRv1)
Order
Service
Comments
HDFS
Start HDFS before all other Deploying HDFS on a services except ZooKeeper. Cluster; Configuring HDFS If you are using HA, see the High Availability CDH4 High Availability Guide for instructions.
HttpFS
HttpFS Installation
4a
MRv1
Start MapReduce before Deploying MapReduce v1 Hive or Oozie. Do not start (MRv1) on a Cluster; MRv1 if YARN is running. Configuring High Availability for the JobTracker (MRv1)
4b
YARN
Start YARN before Hive or Deploying MapReduce v2 Oozie. Do not start YARN (YARN) on a Cluster if MRv1 is running.
HBase
Hive
Start the Hive metastore Installing Hive on page 209 before starting HiveServer2 and the Hive console.
Oozie
Flume 1.x
Running Flume
Sqoop
10
Hue
Hue Installation
To start system services at boot time and on restarts, enable their init scripts on the systems on which the services will run, using the appropriate tool: chkconfig is included in the Red Hat and CentOS distributions. Debian and Ubuntu users can install the chkconfig package. update-rc.d is included in the Debian and Ubuntu distributions.
Where
Command
On the NameNode
On the JobTracker
On each TaskTracker
On each DataNode
Where
Command
On the NameNode
On the ResourceManager
On each NodeManager
On each DataNode
Where
Command
On the NameNode
On the ResourceManager
On each NodeManager
On each DataNode
Component
Server
Command
Hue
Hue server
Oozie
Oozie server
HBase
HBase master
Hive server
Zookeeper
Zookeeper server
HttpFS
HttpFS server
Component
Server
Command
Hue
Hue server
Oozie
Oozie server
HBase
HBase master
HBase slave
Zookeeper
Zookeeper server
HttpFS
HttpFS server
Stopping Services
Run the following command on every host in the cluster to shut down all Hadoop Common system services that are started by init in the cluster:
$ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done
To verify that no Hadoop processes are running, issue the following command on each host::
# ps -aef | grep java
Hue
Run the following on the Hue Server machine to stop Hue sudo service
hue stop
Sqoop
Flume 0.9
Stop the Flume Node processes on each node where they are running:
sudo service flume-node stop Stop the Flume Master sudo service flume-master stop
Flume 1.x
There is no Flume master Stop the Flume Node processes on each node where they are running:
sudo service flume-ng-agent stop
Oozie
Hive
To stop Hive, exit the Hive console and make sure no Hive scripts are running. Shut down HiveServer2:
sudo service hiveserver2 stop
Order
Service
Comments
If the metastore is running from the command line, use Ctrl-c to shut it down.
HBase
Stop the Thrift server and To stop the Thrift server clients, then shut down the and clients: sudo service cluster. hbase-thrift stop To shut down the cluster, use this command on the master node: sudo
service hbase-master stop
8a
MapReduce v1
Stop Hive and Oozie before To stop MapReduce, stop stopping MapReduce. the JobTracker service, and stop the Task Tracker on all nodes where it is running. Use the following commands:
sudo service h a d o o p 0 . 2 0 m a p r e d u c e j o b t r a c k e r stop sudo service h a d o o p 0 . 2 0 m a p r e d u c e t a s k t r a c k e r stop
8b
YARN
Stop Hive and Oozie before To stop YARN, stop the stopping YARN. MapReduce JobHistory service, ResourceManager service, and NodeManager on all nodes where they are running. Use the following commands:
sudo service hadoop-mapreduce-historyserver stop sudo service hadoop-yarn-resourcemanager
Order
Service
Comments
Instructions
stop sudo service hadoop-yarn-nodemanager stop
HttpFS
10
HDFS
11
ZooKeeper
To stop the ZooKeeper server, use one of the following commands on each ZooKeeper node:
sudo service zookeeper-server stop
or
sudo service zookeeper stop
Red-Hat-compatible
yum remove
Operating System
Commands
Comments
option to remove only the installed packages or with the purge option to remove packages and configuration
SLES
zypper remove
Mahout
Whirr
Hue
Pig
Sqoop
Flume
Oozie client
Oozie server
Hive
HBase
ZooKeeper server
Component to remove
Command
HttpFS
Mahout
Whirr
Hue
Pig
Sqoop
Flume
Oozie client
Component to remove
Command
Oozie server
Hive
HBase
ZooKeeper server
ZooKeeper client
HttpFS
Mahout
Whirr
Hue
Pig
Component removed
Command
Sqoop
Flume
Oozie server
Oozie client
Hive
HBase
ZooKeeper server
ZooKeeper client
HttpFS
Additional clean-up
The uninstall commands may not remove all traces of Hadoop from your system. The apt-get purge commands available for Debian and Ubuntu systems delete more files than the commands that use the remove option but are still not comprehensive. If you want to remove all vestiges of Hadoop from your system, look for the following and remove them manually: log files modified system configuration files
SSH
It is a good idea to use SSH for remote administration purposes (instead of rlogin, for example). But note that it is not used to secure communication among the elements in a Hadoop cluster (DataNode, NameNode, TaskTracker or YARN ResourceManager, JobTracker or YARN NodeManager, or the /etc/init.d scripts that start daemons locally). The Hadoop components use SSH in the following cases: The sshfencer component of High Availability Hadoop configurations uses SSH; the shell fencing method does not require SSH. Whirr uses SSH to enable secure communication with the Whirr cluster in the Cloud. See the Whirr Installation instructions.
HTTPS
Some communication within Hadoop can be configured to use HTTPS. Implementing this requires generating valid certificates and configuring clients to use those certificates. The HTTPS functionality that can be configured in CDH4 is: Encrypted MapReduce Shuffle (both MRv1 and YARN). Encrypted Web UIs; the same configuration parameters that enable Encrypted MapReduce Shuffle implement Encrypted Web UIs. These features are discussed under Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport.
Mountable HDFS
Mountable HDFS
CDH4 includes a FUSE (Filesystem in Userspace) interface into HDFS. FUSE enables you to write a normal userland application as a bridge for a traditional filesystem interface. The hadoop-hdfs-fuse package enables you to use your HDFS cluster as if it were a traditional filesystem on Linux. It is assumed that you have a working HDFS cluster and know the hostname and port that your NameNode exposes. To install fuse-dfs On Red Hat-compatible systems:
$ sudo yum install hadoop-hdfs-fuse
You now have everything you need to begin mounting HDFS on Linux. To set up and test your mount point:
$ mkdir -p <mount_point> $ hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>
You can now run operations as if they are on your mount point. Press Ctrl+C to end the fuse-dfs program, and umount the partition if it is still mounted. Note: If you are using SLES 11 with the Oracle JDK 6u26 package, hadoop-fuse-dfs may exit immediately because ld.so can't find libjvm.so. To work around this issue, add /usr/java/latest/jre/lib/amd64/server to the LD_LIBRARY_PATH. To clean up your test:
$ umount <mount_point>
You can now add a permanent HDFS mount which persists through reboots. To add a system mount: 1. Open /etc/fstab and add lines to the bottom similar to these:
hadoop-fuse-dfs#dfs://<name_node_hostname>:<namenode_port> <mount_point> fuse allow_other,usetrash,rw 2 0
For example:
hadoop-fuse-dfs#dfs://localhost:8020 /mnt/hdfs fuse allow_other,usetrash,rw 2 0
Your system is now configured to allow you to use the ls command and use that mount point as if it were a normal system disk.
Mountable HDFS
You can change the JVM maximum heap size. By default, the CDH4 package installation creates the /etc/default/hadoop-fuse file with the following default JVM maximum heap size. To change it:
export LIBHDFS_OPTS="-Xmx128m"
It should be set to point to the directory where the JDK is installed, as shown in the example below. On systems on which sudo clears or restricts environment variables, you also need to add the following line to the /etc/sudoers file:
Defaults env_keep+=JAVA_HOME
You may be able to install the Oracle JDK with your package manager, depending on your choice of operating system. See this section for installation instructions.
where <jdk-install-dir> might be something like /usr/java/jdk1.6.0_31, depending on the system configuration and where the JDK is actually installed.
3. On the same computer as in the previous steps, download the yum repository into a temporary location. On Red Hat/CentOS 6, you can use a command such as:
reposync -r cloudera-cdh4
Note:
cloudera-cdh4 is the name of the repository on your system; the name is usually in square
brackets on the first line of the repo file, which in this example is /etc/yum.repos.d/cloudera-cdh4.repo. 4. Put all the RPMs into a directory served by your web server. For this example, we'll call it /var/www/html/cdh/4/RPMS/noarch/ (or x86_64 or i386 instead of noarch). Make sure you can remotely access the files in the directory you just created (the URL should look like http://<yourwebserver>/cdh/4/RPMS/). 5. On your web server, go to /var/www/html/cdh/4/ and type the following command:
createrepo .
This will create or update the necessary metadata so yum can understand this new repository (you will see a new directory named repodata).
Important: Check the permissions of the subdirectories and files under /var/www/html/cdh/4/. Make sure they are all readable by your web server user. 6. Edit the repo file you got from Cloudera (see Before You Start) and replace the line starting with baseurl= or mirrorlist= with baseurl=http://<yourwebserver>/cdh/4/ 7. Save this modified repo file in /etc/yum.repos.d/, and check that you can install CDH through yum. Example:
yum update && yum install hadoop
Once you have confirmed that your internal mirror works, you can distribute this modified repo file to all your machines, and they should all be able to install CDH without needing access to the Internet. Follow the instructions under CDH4 Installation.
For detailed information for each CDH component, by release, see Using the CDH4 Maven Repository.
Prerequisites
Oracle Java Development Kit (JDK) version 6. Apache Ant version 1.7 or later. Apache Maven 3.0 or later. The following environment variables must be set: JAVA_HOME, JAVA5_HOME, FORREST_HOME, and ANT_HOME. Your PATH must include the JAVA_HOME, ANT_HOME, FORREST_HOME and maven bin directories. If you are using Red Hat or CentOS systems, the rpmdevtools package is required for the rpmdev-setuptree command used below.
SLES systems
Users of these systems can run the following command to set up their environment:
$ mkdir -p ~/rpmbuild/{BUILD,RPMS,S{OURCE,PEC,RPM}S} $ echo "%_topdir $HOME/rpmbuild"> ~/.rpmmacros
Building an RPM
Download SRPMs from archive.cloudera.com. The source RPMs for CDH4 reside at http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/4/SRPMS/, http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/4/SRPMS/ or
Getting Support
Getting Support
This section describes how to get support for CDH4: Clouder Support Community Support Report Issues Get Announcements about New Releases
Cloudera Support
Cloudera can help you install, configure, optimize, tune, and run Hadoop for large scale data processing and analysis. Cloudera supports Hadoop whether you run our distribution on servers in your own data center, or on hosted infrastructure services such as Amazon EC2, Rackspace, SoftLayer, or VMware's vCloud. If you are a Cloudera customer, you can: Create a Cloudera Support Ticket. Visit the Cloudera Knowledge Base. Learn how to register for an account to create a support ticket at the support site. If you are not a Cloudera customer, learn how Cloudera can help you.
Community Support
Register for the Cloudera Users groups. If you have any questions or comments about CDH, you can send a message to the CDH user's list: [email protected] If you have any questions or comments about using Cloudera Manager, you can send a message to the Cloudera Manager user's list: [email protected]
Report Issues
Cloudera tracks software and documentation bugs and enhancement requests for CDH on issues.cloudera.org. Your input is appreciated, but before filing a request, please search the Cloudera issue tracker for existing issues and send a message to the CDH user's list, [email protected], or the CDH developer's list, [email protected]. If you would like to report or view software issues and enhancement requests for Cloudera Manager, visit this site: https://issues.cloudera.org/browse/CM
Getting Support
Apache License
All software developed by Cloudera for CDH is released with an Apache 2.0 license. Please let us know if you find any file that doesn't explicitly state the Apache license at the top and we'll immediately fix it. Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ Copyright 2010-2013 Cloudera Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Third-Party Licenses
For a list of third-party licenses associated with CDH, see Third-Party Licenses.