Cloudera Hive
Cloudera Hive
Cloudera Hive
Important Notice
© 2010-2021 Cloudera, Inc. All rights reserved.
Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and company
names or logos mentioned in this document are the property of their respective owners.
Reference to any products, services, processes or other information, by trade name,
trademark, manufacturer, supplier or otherwise does not constitute or imply
endorsement, sponsorship or recommendation thereof by us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced, stored
in or introduced into a retrieval system, or transmitted in any form or by any means
(electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Cloudera.
The information in this document is subject to change without notice. Cloudera shall
not be liable for any damages resulting from technical errors or omissions which may
be present in this document, or from use of this document.
Cloudera, Inc.
395 Page Mill Road
Palo Alto, CA 94306
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Hive/Impala Replication.........................................................................................95
Network Latency and Replication.......................................................................................................................95
Host Selection for Hive/Impala Replication........................................................................................................95
Hive Tables and DDL Commands........................................................................................................................96
Replication of Parameters..................................................................................................................................96
Hive Replication in Dynamic Environments........................................................................................................96
Guidelines for Snapshot Diff-based Replication.................................................................................................97
Replicating from Insecure to Secure Clusters.....................................................................................................97
Configuring Replication of Hive/Impala Data.....................................................................................................98
Replication of Impala and Hive User Defined Functions (UDFs).........................................................................................102
Viewing Replication Schedules.........................................................................................................................102
Enabling, Disabling, or Deleting A Replication Schedule....................................................................................................105
Viewing Replication History.............................................................................................................................105
Hive/Impala Replication To and From Cloud Storage.......................................................................................106
Embedded Mode
Cloudera recommends using this mode for experimental purposes only.
Embedded mode is the default metastore deployment mode for CDH. In this mode, the metastore uses a Derby
database, and both the database and the metastore service are embedded in the main HiveServer2 process. Both are
started for you when you start the HiveServer2 process. This mode requires the least amount of effort to configure,
but it can support only one active user at a time and is not certified for production use.
Local Mode
In Local mode, the Hive metastore service runs in the same process as the main HiveServer2 process, but the metastore
database runs in a separate process, and can be on a separate host. The embedded metastore service communicates
with the metastore database over JDBC.
Remote Mode
Cloudera recommends that you use this mode.
In Remote mode, the Hive metastore service runs in its own JVM process. HiveServer2, HCatalog, Impala, and other
processes communicate with it using the Thrift network API (configured using the hive.metastore.uris property).
The metastore service communicates with the metastore database over JDBC (configured using the
javax.jdo.option.ConnectionURL property). The database, the HiveServer2 process, and the metastore service
can all be on the same host, but running the HiveServer2 process on a separate host provides better availability and
scalability.
The main advantage of Remote mode over Local mode is that Remote mode does not require the administrator to
share JDBC login information for the metastore database with each Hive user. HCatalog requires this mode.
Important: These numbers are general guidance only, and can be affected by factors such as number
of columns, partitions, complex joins, and client activity. Based on your anticipated deployment, refine
through testing to arrive at the best values for your environment.
For information on configuring heap for the Hive metastore, as well as HiveServer2 and Hive clients, see Tuning Apache
Hive in CDH on page 63.
[mysqld]
datadir=/var/lib/mysql
max_connections=8192
. . .
[Service]
LimitNOFILE=24000
. . .
hive.server2.async.exec.threads 8192
hive.server2.async.exec.wait.queue.size 8192
hive.server2.thrift.max.worker.threads 8192
• Set datanucleus.connectionPool.maxPoolSize for your applications. For example, if poolSize = 100, with
3 HMS instances (one dedicated to compaction), and with 4 pools per server, you can accommodate 1200
connections.
Note: For information about additional configuration that may be needed in a secure cluster, see
Hive Authentication.
After using the command to install MySQL, you may need to respond to prompts to confirm that you do want to
complete the installation. After installation completes, start the mysql daemon.
On RHEL systems
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!
• On SLES systems:
• On Debian/Ubuntu systems:
Note: If the metastore service will run on the host where the database is installed, replace
'metastorehost' in the CREATE USER example with 'localhost'. Similarly, the value of
javax.jdo.option.ConnectionURL in /etc/hive/conf/hive-site.xml (discussed in
the next step) must be jdbc:mysql://localhost/metastore. For more information on
adding MySQL users, see http://dev.mysql.com/doc/refman/5.5/en/adding-users.html.
Create the initial database schema. Cloudera recommends using the Metastore Schema Tool to do this.
If for some reason you decide not to use the schema tool, you can use the hive-schema-n.n.n.mysql.sql
file instead; that file is located in the /usr/lib/hive/scripts/metastore/upgrade/mysql/ directory. (n.n.n
is the current Hive version, for example 1.1.0.) Proceed as follows if you decide to use
hive-schema-n.n.n.mysql.sql.
Note: Do this only if you are not using the Hive schema tool.
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-n.n.n.mysql.sql;
You also need a MySQL user account for Hive to use to access the metastore. It is very important to prevent this
user account from creating or altering tables in the metastore database schema.
Important: To prevent users from inadvertently corrupting the metastore schema when they
use lower or higher versions of Hive, set the hive.metastore.schema.verification property
to true in /usr/lib/hive/conf/hive-site.xml on the metastore host.
Example
Note: The hive.metastore.local property is no longer supported (as of Hive 0.10); setting
hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://myhost/metastore</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoStartMechanism</name>
<value>SchemaTable</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
After using the command to install PostgreSQL, you may need to respond to prompts to confirm that you do want
to complete the installation. In order to finish installation on RHEL compatible systems, you need to initialize the
database. Please note that this operation is not needed on Ubuntu and SLES systems as it's done automatically
on first start:
To initialize database files on RHEL compatible systems
To ensure that your PostgreSQL server will be accessible over the network, you need to do some additional
configuration.
First you need to edit the postgresql.conf file. Set the listen_addresses property to *, to make sure that
the PostgreSQL server starts listening on all your network interfaces. Also make sure that the
standard_conforming_strings property is set to off.
You can check that you have the correct values as follows:
On Red-Hat-compatible systems:
On SLES systems:
You also need to configure authentication for your network in pg_hba.conf. You need to make sure that the
PostgreSQL user that you will create later in this procedure will have access to the server from a remote host. To
do this, add a new line into pg_hba.con that has the following information:
The following example allows all users to connect from all hosts to all your databases:
Note: This configuration is applicable only for a network listener. Using this configuration does
not open all your databases to the entire world; the user must still supply a password to
authenticate himself, and privilege restrictions configured in PostgreSQL will still be applied.
After completing the installation and configuration, you can start the database server:
Use chkconfig utility to ensure that your PostgreSQL server will start at a boot time. For example:
chkconfig postgresql on
You can use the chkconfig utility to verify that PostgreSQL server will be started at boot time, for example:
Now you need to grant permission for all metastore tables to user hiveuser. PostgreSQL does not have statements
to grant the permissions for all tables at once; you'll need to grant the permissions one table at a time. You could
automate the task with the following SQL script:
Note: If you are running these commands interactively and are still in the Postgres session
initiated at the beginning of this step, you do not need to repeat sudo -u postgres psql.
You can verify the connection from the machine where you'll be running the metastore service as follows:
Note:
• The instructions in this section assume you are using Remote mode, and that the PostgreSQL
database is installed on a separate host from the metastore server.
• The hive.metastore.local property is no longer supported as of Hive 0.10; setting
hive.metastore.uris is sufficient to indicate that you are using a remote metastore.
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://myhost/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
Note: This URLs was correct at the time of publication, but can change.
Connect as the newly created hiveuser user and load the initial schema, as in the following example. Use the
appropriate script for the current release (for example hive-schema-1.1.0.oracle.sql) in
/usr/lib/hive/scripts/metastore/upgrade/oracle/ :
$ sqlplus hiveuser
SQL> @/usr/lib/hive/scripts/metastore/upgrade/oracle/hive-schema-n.n.n.oracle.sql
Connect back as an administrator and remove the power privileges from user hiveuser. Then grant limited access
to all the tables:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:oracle:thin:@//myhost/xe</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>oracle.jdbc.OracleDriver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore
host</description>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>true</value>
</property>
Warning:
This configuration setting is intended for advanced database users only. Be aware that when using
this override, the following properties are overwritten (in other words, their values will not be used):
• Hive Metastore Database Name
• Hive Metastore Database Host
• Hive Metastore Database Port
• Enable TLS/SSL to the Hive Metastore Database
Prerequisites
• The required default user role is Configurator.
• When using the Hive Metastore Database JBC URL Override, you must still provide the following properties to
connect to the database:
– Hive Metastore Database Type
– Hive Metastore Database User
– Hive Metastore Database Password
PostgreSQL jdbc:postgresql://<host>:<port>/
<metastore_db>?key=value
Important: Formats are dependent on the JDBC driver version that you are using and subject to
change between releases. Refer to your database product documentation to confirm JDBC formats
for the specific database version you are using.
Important: Because of concurrency and security issues, HiveServer1 was deprecated in CDH 5.3 and
has been removed from CDH 6.
Important: These numbers are general guidance only, and can be affected by factors such as number
of columns, partitions, complex joins, and client activity. Based on your anticipated deployment, refine
through testing to arrive at the best values for your environment.
For information on configuring heap for HiveServer2, as well as Hive metastore and Hive clients, see Tuning Apache
Hive in CDH on page 63 and the following video:
After you start the video, click YouTube in the lower right corner of the player window to watch it on YouTube where
you can resize it for clearer viewing.
Figure 1: Troubleshooting HiveServer2 Service Crashes
hive.zookeeper.client.port
If ZooKeeper is not using the default value for ClientPort, you need to set hive.zookeeper.client.port in
/etc/hive/conf/hive-site.xml to the same value that ZooKeeper is using. Check
/etc/zookeeper/conf/zoo.cfg to find the value for ClientPort. If ClientPort is set to any value other than
2181 (the default), sethive.zookeeper.client.port to the same value. For example, if ClientPort is set to
2222, set hive.zookeeper.client.port to 2222 as well:
<property>
<name>hive.zookeeper.client.port</name>
<value>2222</value>
<description>
The port at which the clients will connect.
</description>
</property>
JDBC driver
The connection URL format and the driver class for HiveServer2:
Authentication
HiveServer2 can be configured to authenticate all connections; by default, it allows any client to connect. HiveServer2
supports either Kerberos or LDAP authentication; configure this in the hive.server2.authentication property
in the hive-site.xml file. You can also configure Pluggable Authentication, which allows you to use a custom
authentication provider for HiveServer2; and HiveServer2 Impersonation, which allows users to execute queries and
access HDFS files as the connected user rather than the super user who started the HiveServer2 daemon. For more
information, see Hive Security Configuration.
Running HiveServer2
Important: Because of concurrency and security issues, HiveServer1 was deprecated in CDH 5.3 and
has been removed from CDH 6. The Hive CLI is deprecated and will be removed in a future release.
Cloudera recommends you migrate to Beeline and HiveServer2 as soon as possible. The Hive CLI is
not needed if you are using Beeline with HiveServer2.
HiveServer2 binds to port 10000 by default. Set the port for HiveServer2 in the hive.server2.thrift.port property
in the hive-site.xml file. For example:
<property>
<name>hive.server2.thrift.port</name>
<value>10001</value>
<description>TCP port number to listen on, default 10000</description>
</property>
You can also specify the port and the host IP address for HiveServer2 by setting these environment variables:
HIVE_SERVER2_THRIFT_PORT HIVE_SERVER2_THRIFT_BIND_HOST
Important:
If you are running the metastore in Remote mode, you must start the metastore before starting
HiveServer2.
After installing and configuring the Hive metastore, you can start the service.
To run the metastore as a daemon, the command is:
Important: If you are using Sentry, do not follow the instructions on this page. See Before Enabling
the Sentry Service for information on how to set up the Hive warehouse directory permissions for use
with Sentry.
In addition, each user submitting queries must have an HDFS home directory. /tmp (on the local file system) must be
world-writable, as Hive makes extensive use of it.
HiveServer2 Impersonation allows users to execute queries and access HDFS files as the connected user.
If you do not enable impersonation, HiveServer2 by default executes all Hive tasks as the user ID that starts the Hive
server; for clusters that use Kerberos authentication, this is the ID that maps to the Kerberos principal used with
HiveServer2. Setting permissions to 1777, as recommended above, allows this user access to the Hive warehouse
directory.
You can change this default behavior by setting hive.metastore.execute.setugi to true on both the server and
client. This setting causes the metastore server to use the client's user and group permissions.
Warning:
If you are running the metastore in Remote mode, you must start the Hive metastore before you start
HiveServer2. HiveServer2 tries to communicate with the metastore as part of its initialization bootstrap.
If it is unable to do this, it fails with an error.
Note that because of concurrency and security issues, HiveServer1 was deprecated in CDH 5.3 and
has been removed from CDH 6.
To start HiveServer2:
To stop HiveServer2:
To confirm that HiveServer2 is working, start the beeline CLI and use it to execute a SHOW TABLES query on the
HiveServer2 process:
$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
+-----------+
| tab_name |
+-----------+
+-----------+
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>
Note:
Cloudera does not currently support using the Thrift HTTP protocol to connect Beeline to HiveServer2
(meaning that you cannot set hive.server2.transport.mode=http). Use the Thrift TCP protocol.
Use the following commands to start beeline and connect to a running HiveServer2 process. In this example the
HiveServer2 process is running on localhost at port 10000:
$ beeline
beeline> !connect jdbc:hive2://localhost:10000 username password
org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000>
Note:
If you are using HiveServer2 on a cluster that does not have Kerberos security enabled, then the
password is arbitrary in the command for starting Beeline.
If you are using HiveServer2 on a cluster that does have Kerberos security enabled, see HiveServer2
Security Configuration.
If you use TLS/SSL encryption, discussed later, the JDBC URL must include specifying
ssl=true;sslTrustStore=<path_to_truststore>. Truststore password requirements depend on the version of Java running
in the cluster
• Java 11: the truststore format has changed to PKCS and the truststore password is required; otherwise, the
connection fails.
• Java 8: The trust store password does not need to be specified.
The syntax for the JDBC URL is:
jdbc:hive2://#<host>:#<port>/#<dbName>;ssl=true;sslTrustStore=#<ssl_truststore_path>;trustStorePassword=#<truststore_password>;#<otherSessionConfs>?#<hiveConfs>#<hiveVars>
For example:
$ beeline
beeline> !connect
jdbc:hive2://<host>:8443/;ssl=true;transportMode=http;httpPath=gateway/cdp-proxy-api/hive;sslTrustStore=/<path>/bin/certs/gateway-client-trust.jks;trustStorePassword=changeit
• If Kerberos is not enabled, assign the following roles to the gateway node:
– Hive Gateway Role
– HDFS Gateway Role
2. From the Cloudera Manager home page, click the Hive service.
3. On the Hive service page, select the Configuration tab.
4. On the Hive service Configuration page, type hbase into the search text box.
5. Locate the HBase Service configuration property on the page, select the HBase instance that you want to associate
with Hive, and click Save Changes.
6. Redeploy the client configuration for the Hive service and restart all stale services.
The HBase service is now associated with the Hive service, and your Hive scripts can use HBase.
Note:
If you are using Cloudera Manager to manage your clusters, the Metastore schematool is also
available in the Hive service page to validate or upgrade the metastore:
1. From the Cloudera Manager Admin console, select the Hive service.
2. • To validate the schema, on the Hive service page, click Actions, and select Validate Hive
Metastore Schema.
• To upgrade the schema:
1. On the Hive service page, click Actions, and select Stop to stop the service.
2. Still on the Hive service page, click Actions, and select Upgrade Hive Database Metastore
Schema.
3. After the upgrade completes, restart the service.
...
Caused by: MetaException(message:Version information not found in metastore. )
at org.apache.hadoop.hive.metastore.ObjectStore.checkSchema(ObjectStore.java:5638)
...
Use Hive schematool to repair the condition that causes this error by either initializing the schema or upgrading it.
Using schematool
Use the Metastore schematool to initialize the metastore schema for the current Hive version or to upgrade the
schema from an older version. The tool tries to find the current schema from the metastore if it is available there.
The schematool determines the SQL scripts that are required to initialize or upgrade the schema and then executes
those scripts against the metastore database. The metastore database connection information such as JDBC URL, JDBC
driver, and database credentials are extracted from the Hive configuration. You can provide alternate database
credentials if needed.
The following options are available as part of the schematool package:
$ schematool -help
usage: schemaTool
-dbType <databaseType> Metastore database type
-dryRun List SQL scripts (no execute)
The dbType option must always be specified and can be one of the following:
derby|mysql|postgres|oracle
Prerequisite Configuration
Before you can use the schematool, you must add the following properties to the /etc/hive/conf/hive-site.xml
file:
• javax.jdo.option.ConnectionURL
• javax.jdo.option.ConnectionDriverName
For example, the following hive-site.xml entries are made if you are using a MySQL database as your Hive metastore
and hive1 is the database user name:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://my_cluster.com:3306/hive1?useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
Note: The Hive Schema tool does not support using TLS/SSL encryption to HMS database.
Usage Examples
To use the schematool command-line tool, navigate to the directory where it is located:
• If you installed CDH using parcels, schematool is usually located at:
/opt/cloudera/parcels/CDH/lib/hive/bin/schematool
/usr/lib/hive/bin/schematool
After you locate the executable, you can use schematool to perform the following actions:
• Initialize your metastore to the current schema for a new Hive setup using the initSchema option.
• If you attempt to get schema information from older metastores that did not store version information or if the
schema is not initialized, the tool reports an error as follows.
• You can upgrade the schema from a specific release by specifying the -upgradeSchemaFrom option. The
-upgradeSchemaFrom option requires the Hive version and not the CDH version. See CDH 6 Packaging Information
for information about which Hive version ships with each CDH release. The following example shows how to
upgrade from CDH 5.2/Hive 0.13.1:
• Use the -validate option to verify the metastore schema. The following example shows the types of validations
that are performed against the metastore schema when you use this option with schematool:
• If you want to find out all the required scripts for a schema upgrade, use the dryRun option.
Setting HADOOP_MAPRED_HOME
• For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop
in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly, as follows:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
• For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop
in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce
export HIVE_CONF_DIR=/var/run/cloudera-scm-agent/process/4595-hive-HIVEMETASTORE
export HADOOP_CREDSTORE_PASSWORD=abcdefg1234...
export AUX_CLASSPATH=/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2....
5. Run the following command to connect to the database and list FS roots:
Listing FS Roots..
hdfs://[hostname]:8020/user/hive/warehouse
Alternatively, instead of setting and exporting environment variables, open the hive-site.xml file in /etc/hive/conf/.
Add the following properties to the hive-site.xml:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://[ENTER BACKEND DATABASE HOSTNAME]:[ENTER PORT]/[ENTER HIVE BACKEND
DATABASE USERNAME]?useUnicode=true&characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>[ENTER BACKEND DATABASE USERNAME]</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>[ENTER BACKEND DATABASE PASSWORD]</value>
</property>
To determine what the backend database host, port, username, or password is, in Cloudera Manager, go to Hive >
Configurations. Set Category Filter to Hive Metastore Database. The password is not exposed in plaintext.
To configure other CDH components to use HDFS high availability, see Configuring Other CDH Components to Use
HDFS HA.
Hive Roles
Hive is implemented in three roles:
• Hive metastore - Provides metastore services when Hive is configured with a remote metastore.
Cloudera recommends using a remote Hive metastore. Because the remote metastore is recommended, Cloudera
Manager treats the Hive Metastore Server as a required role for all Hive services. A remote metastore provides
the following benefits:
– The Hive metastore database password and JDBC drivers do not need to be shared with every Hive client;
only the Hive Metastore Server does. Sharing passwords with many hosts is a security issue.
– You can control activity on the Hive metastore database. To stop all activity on the database, stop the Hive
Metastore Server. This makes it easy to back up and upgrade, which require all Hive activity to stop.
See Configuring the Hive Metastore.
For information about configuring a remote Hive metastore database with Cloudera Manager, see Step 4: Install
and Configure Databases. To configure high availability for the Hive metastore, see Configuring HMS High Availability
in CDH on page 81.
• HiveServer2 - Enables remote clients to run Hive queries, and supports a Thrift API tailored for JDBC and ODBC
clients, Kerberos authentication, and multi-client concurrency. A CLI named Beeline is also included. See HiveServer2
documentation for more information.
• WebHCat - HCatalog is a table and storage management layer for Hadoop that makes the same table information
available to Hive, Pig, MapReduce, and Sqoop. Table definitions are maintained in the Hive metastore, which
HCatalog requires. WebHCat allows you to access HCatalog using an HTTP (REST style) interface.
set hive.execution.engine=spark;
set hive.execution.engine;
7. Click the icon that is next to any stale services to invoke the cluster restart wizard.
8. Click Restart Stale Services.
9. Click Restart Now.
10. Click Finish.
Important:
The configuration property serialization.null.format is set in Hive and Impala engines as
SerDes or table properties to specify how to serialize/deserialize NULL values into a storage format.
This configuration option is suitable for text file formats only. If used with binary storage formats such
as RCFile or Parquet, the option causes compatibility, complexity and efficiency issues.
See Using Avro Data Files in Hive for details about using Avro to ingest data into Hive tables and about using Snappy
compression on the output files.
CDH lets you use the component of your choice with the Parquet file format for each phase of data processing. For
example, you can read and write Parquet files using Pig and MapReduce jobs. You can convert, transform, and query
Parquet tables through Hive, Impala, and Spark. And you can interchange data files between all of these components.
Note:
• Once you create a Parquet table, you can query it or insert into it through other components
such as Impala and Spark.
• Set dfs.block.size to 256 MB in hdfs-site.xml.
• To enhance performance on Parquet tables in Hive, see Enabling Query Vectorization.
If the table will be populated with data files generated outside of Impala and Hive, you can create the table as an
external table pointing to the location where the files will be created:
To populate the table with an INSERT statement, and to read the table with a SELECT statement, see Loading Data
into Parquet Tables.
To set the compression type to use when writing data, configure the parquet.compression property:
SET parquet.compression=GZIP;
INSERT OVERWRITE TABLE tinytable SELECT * FROM texttable;
For more information on how to reserve YARN cores and memory that will be used by Spark
executors, refer to Tuning Apache Hive on Spark in CDH on page 70.
SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask.size=<number_in_megabytes>;
When you are using HoS and the tables involved in a join query trigger a map join, two Spark jobs are launched and
perform the following actions:
• the first job scans the smaller table, creates a hash table, and writes it to HDFS,
• the second job runs the join and the rest of the query, scanning the larger table.
If DPP is enabled and is also triggered, the two Spark jobs perform the following actions:
• the first Spark job creates the hash table from the small table and identifies the partitions that should be scanned
from the large table,
• the second Spark job then scans the relevant partitions from the large table that are to be used in the join.
After these actions are performed, the query proceeds normally with the map join.
Important: Cloudera does not support nor recommend setting the property
hive.spark.dynamic.partition.pruning to true in production environments. This property
enables DPP for all joins, both map joins and common joins. The property
hive.spark.dynamic.partition.pruning.map.only, which enables DPP for map joins only in
Hive on Spark is the only supported implementation of DPP for Hive on Spark in CDH.
hive.spark.dynamic.partition.pruning Enables dynamic partition pruning for all joins, including shuffle joins false (turned off)
and map joins.
Important:
Setting
this
property
to true
is not
supported
in CDH.
SET hive.spark.dynamic.partition.pruning.map.join.only=true;
SET hive.execution.engine=spark;
SET hive.spark.dynamic.partition.pruning.map.join.only=true;
Then run the following commands, which tell Hive to use the testing_example_db database and to show (EXPLAIN)
the query plan for the query that follows:
USE testing_example_db;
EXPLAIN
SELECT dt.d_year
,item.i_brand_id brand_id
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
The EXPLAIN command returns the query plan for the TPC-DS query. An excerpt from that query plan is included
below. Look for the Spark HashTable Sink Operator and the Spark Partition Pruning Sink Operator,
which are in bold font in the following output. Presence of these sink operators in the query plan indicate that DPP is
being triggered for the query.
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-2 is a root stage |
| Stage-1 depends on stages: Stage-2 |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-2 |
| Spark |
| DagName: hive_20170908151313_f478b7d3-89b8-4c6d-b98c-4ef3b8e25bf7:964 |
| Vertices: |
| Map 1 |
| Map Operator Tree: |
| TableScan |
| alias: dt |
| filterExpr: (d_date_sk is not null and (d_moy = 12)) (type: boolean)
|
| Statistics: Num rows: 73049 Data size: 2045372 Basic stats: COMPLETE
Column stats: NONE |
| Filter Operator |
| predicate: (d_date_sk is not null and (d_moy = 12)) (type: boolean)
|
| Statistics: Num rows: 18262 Data size: 511336 Basic stats: COMPLETE
Column stats: NONE |
| Spark HashTable Sink Operator |
| keys: |
| 0 d_date_sk (type: bigint) |
| 1 ss_sold_date_sk (type: bigint) |
| Select Operator |
| expressions: d_date_sk (type: bigint) |
| outputColumnNames: _col0 |
| Statistics: Num rows: 18262 Data size: 511336 Basic stats:
COMPLETE Column stats: NONE |
| Group By Operator |
| keys: _col0 (type: bigint) |
| mode: hash |
| outputColumnNames: _col0 |
| Statistics: Num rows: 18262 Data size: 511336 Basic stats:
COMPLETE Column stats: NONE |
| Spark Partition Pruning Sink Operator |
| partition key expr: ss_sold_date_sk |
| tmp Path:
hdfs://<server_name>.<domain>.com:8020/tmp/hive/hive/a8939414-8311-4b06-bbd6-5afc9c3b2d3d/hive_2017-09-08_15-13-54_861_527211251736847122-4/-mr-10003/2/1
|
| Statistics: Num rows: 18262 Data size: 511336 Basic stats:
COMPLETE Column stats: NONE |
| target column name: ss_sold_date_sk |
| target work: Map 2 |
| Local Work: |
| Map Reduce Local Work |
| Map 5 |
Note: There are a few map join patterns that are not supported by DPP. For DPP to be triggered, the
Spark Partition Pruning Sink Operator must have a target Map Work in a child stage. For
example, in the above query plan, the Spark Partition Pruning Sink Operator resides in
Stage-2 and has a target work: Map 2. So for DPP to be triggered, Map 2 must reside in either
Stage 1 or Stage 0 because both are dependent on Stage 2, thus they are both children of Stage
2. See the STAGE DEPENDENCIES at the top of the query plan to see the stage hierarchy. If Map 2
resides in Stage 2, DPP is not triggered because Stage 2 is the root stage and therefore cannot be
a child stage.
Queries That Trigger and Benefit from Dynamic Partition Pruning in Hive on Spark
When tables are created in Hive, it is common practice to partition them. Partitioning breaks large tables into horizontal
slices of data. Each partition typically corresponds to a separate folder on HDFS. Tables can be partitioned when the
data has a "natural" partitioning column, such as a date column. Hive queries that read from partitioned tables typically
filter on the partition column in order to avoid reading all partitions from the table. For example, if you have a partitioned
table called date_partitioned_table that is partitioned on the datetime column, the following query only reads
partitions that are created after January 1, 2017:
SELECT *
FROM date_partitioned_table
WHERE datetime > '2017-01-01';
If the date_partitioned_table table has partitions for dates that extend to 2010, this WHERE clause filter can
significantly decrease the amount of data that needs to be read by the query. This query is easy for Hive to optimize.
When it is compiled, only partitions where datetime is greater than 2017-01-01 need to be read. This form of
partition pruning is known as static partition pruning.
However, when queries become more complex, the filter on the partitioned column cannot be evaluated at runtime.
For example, this query:
SELECT *
FROM date_partitioned_table
WHERE datetime IN (SELECT * FROM non_partitioned_table);
With this type of query, it is difficult for the Hive compiler to optimize its execution because the rows that are returned
by the sub query SELECT * FROM non_partitioned_table are unknown. In this situation, dynamic partition
pruning (DPP) optimizes the query. Hive can dynamically prune partitions from the scan of non_partitioned_table
by eliminating partitions while the query is running. Queries that use this pattern can see performance improvements
when DPP is enabled. Note that this query contains an IN clause which triggers a join between the
date_partitioned_table and the non_partitioned_table. DPP is only triggered when there is a join on a
partitioned column.
DPP might provide performance benefits for Hive data warehouses that use the star or snowflake schema. Performance
improvements are possible for Hive queries that join a partitioned fact table on the partitioned column of a dimension
table if DPP is enabled. The TPC-DS benchmark is a good example where many of its queries benefit from DPP. The
query example from the TPC-DS benchmark listed in the above section with EXPLAIN, triggers DPP:
SELECT dt.d_year
,item.i_brand_id brand_id
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id = 436
AND dt.d_moy=12
GROUP BY dt.d_year
,item.i_brand
,item.i_brand_id
ORDER BY dt.d_year
,sum_agg desc
,brand_id
LIMIT 100;
This query performs a join between the partitioned store_sales table and the non-partitioned date_dim table. The
join is performed against the partition column for store_sales, which is what triggers DPP. The join must be against
a partitioned column for DPP to be triggered.
DPP is only supported for map joins. It is not supported for common joins, those that require a shuffle phase. A single
query may have multiple joins, some of which are map joins and some of which are common joins. Only the join on
the partitioned column must be a map join for DPP to be triggered.
For example, if the following message appears in the HiveServer2 log, it means that DPP will be triggered and that
partitions will be dynamically pruned from the partitioned_table table, which is in bold text in the following
example:
INFO org.apache.hadoop.hive.ql.optimizer.DynamicPartitionPruningOptimization:
[HiveServer2-Handler-Pool: Thread-xx]: Dynamic partitioning:
default@partitioned_table.partition_column
To access these log files in Cloudera Manager, select Hive > HiveServer2 > Log Files > Role Log File.
• Hive on Spark Remote Driver Logs
The Hive on Spark (HoS) Remote Driver logs print debugging information from the Java class
SparkDynamicPartitionPruner. This class does the actual pruning of the partitioned table. Because pruning
happens at runtime, the logs for this class are located in the HoS Remote Driver logs instead of the HiveServer2
logs. These logs print which partitions are pruned from the partitioned table, which can be very useful for
troubleshooting.
For example, if the following message appears in the HoS Remote Driver log, it means that the partition
partition_column=1 is being pruned from the table partitioned_table, both of which are in bold text in
the following example:
To access these log files in Cloudera Manager, select SPARK_ON_YARN > History Server Web UI >
<select_an_application> > Executors > executor id = driver > stderr.
For example, to apply the custom UDF addfunc10 to the salary column of the sample_07 table in the default
database that ships with CDH, use the following syntax:
The above HiveQL statement returns only 10 rows from the sample_07 table.
To use Hive built-in UDFs, see the LanguageManual UDF on the Apache wiki. To create custom UDFs in Hive, see
Managing Apache Hive User-Defined Functions on page 47.
Symptom
The first query after starting a new Hive on Spark session might be delayed due to the start-up time for the Spark on
YARN cluster.
Cause
The query waits for YARN containers to initialize.
Solution
No action required. Subsequent queries will be faster.
Symptom
In the HiveServer2 log you see the following exception: Error:
org.apache.thrift.transport.TTransportException (state=08S01,code=0)
Cause
HiveServer2 memory is set too small. For more information, see stdout for HiveServer2.
Solution
1. Go to the Hive service.
2. Click the Configuration tab.
3. Search for Java Heap Size of HiveServer2 in Bytes, and increase the value. Cloudera recommends a minimum value
of 2 GB.
4. Enter a Reason for change, and then click Save Changes to commit the changes.
5. Restart HiveServer2.
Out-of-memory error
Symptom
In the log you see an out-of-memory error similar to the following:
Cause
The Spark driver does not have enough off-heap memory.
Solution
Increase the driver memory spark.driver.memory and ensure that spark.yarn.driver.memoryOverhead is
at least 20% that of the driver memory.
Symptom
Cluster resources are consumed by Spark applications.
Cause
This can occur if you run multiple Hive on Spark sessions concurrently.
Solution
Manually terminate the Hive on Spark applications:
1. Go to the YARN service.
2. Click the Applications tab.
3. In the row containing the Hive on Spark application, select > Kill.
Configurable Properties
HiveServer2 web UI properties, with their default values in Cloudera Hadoop, are:
hive.server2.webui.max.threads=50
hive.server2.webui.host=0.0.0.0
hive.server2.webui.port=10002
hive.server2.webui.use.ssl=false
hive.server2.webui.keystore.path=""
hive.server2.webui.keystore.password=""
hive.server2.webui.max.historic.queries=25
hive.server2.webui.use.spnego=false
hive.server2.webui.spnego.keytab=""
hive.server2.webui.spnego.principal=<dynamically sets special string, _HOST, as
hive.server2.webui.host or host name>
Tip: To disable the HiveServer2 web UI, set the port to 0 or a negative number
Note: By default, newly created CDH 5.7 (and higher) clusters have the HiveServer2 web UI enabled,
and if using Kerberos, are configured for SPNEGO. Clusters upgraded from an earlier CDH version must
have the UI enabled with the port property; other default values can be preserved in most cases.
Configure the HiveServer2 web UI properties in Cloudera Manager on the Configuration tab.
1. Go to the Hive service.
2. Click the Configuration tab.
3. Select Scope > HiveServer2.
Important: This UDF procedure supports the Serializer/Deserializer interface. For example, you can
reference SerDes JAR files in table properties by registering the SerDes JAR in the same way as UDF
JAR files.
You configure the cluster in one of several ways to find the JAR containing your UDF code, and then you register the
UDF in Hive.
1. Assuming you just built your Java project in IntelliJ, navigate to the JAR in the /target directory of the project.
2. Choose one of the following methods for configuring the cluster to find the JAR, and then follow the respective
step-by-step procedure in sections below:
• Direct JAR reference configuration
Straight-forward, but recommended for development only. Does not support Sentry.
• Hive aux JARs directory configuration
Prevents accidental overwriting of files or functions. Recommended for tested, stable UDFs to prevent
accidental overwriting of files or functions. Does not support Sentry.
• Reloadable aux JAR configuration
Avoids HiveServer restarts. Recommended if you anticipate making frequent changes to the UDF logic.
Supports Sentry.
If you connect to HiveServer through the load balancer, issuing the RELOAD command loads the JAR file only
to the connected HiveServer. Consequently, if you have multiple HiveServer instances behind a load balancer,
you must install the JAR file on each node. You also need to connect to each HS2 instance to issue the RELOAD
command.
3. After configuring the cluster to find the JAR, use Beeline to start Hive.
• On the command line of a node that runs the HiveServer2 role, type Beeline.
• Use the FQDN of the HiveServer in your cluster to replace myhiveserver.com and enter the database user
name and database password, or use the default hive user. For example: beeline -u
jdbc:hive2://myhiveserver.com:10000 -n hive -p
4. Run one of the following CREATE FUNCTION commands that corresponds to your configuration:
• Direct JAR reference
Where the <fully_qualified_class_name> is the full path to the Java class in your JAR file. For example,
Restart HiveServer2.
• Reloadable Aux JAR
If you use the Reloadable Aux JAR method, RELOAD uploads the JAR to the cluster.
hive> RELOAD;
Hive> CREATE FUNCTION <function_name> AS '<fully_qualified_class_name>'
For example,
hive> RELOAD;
hive> CREATE FUNCTION udftypeof AS 'com.mycompany.hiveudf.Typeof01';
$ sudo su - hdfs
$ hdfs dfs -put TypeOf-1.0-SNAPSHOT.jar /user/max/udf/hiveudf-1.0-SNAPSHOT.jar
2. Give the hive user read, write, and execute access to the directory.
3. If the Hive Metastore runs on a different host or hosts, create the same directory as you created on the HiveServer2
on every Hive Metastore host. For example, create /opt/local/hive/lib/ on the Hive Metastore host. You do not
need to copy the JAR file to the Hive Metastore host directory. Hive takes care of that when you register the UDF.
If the directory is not present on the Hive Metastore host, Hive Metastore service does not start.
4. In Cloudera Manager Admin Console > Hive service > Configuration > Filters > Advanced, click Hive (Service-Wide)
scope.
5. In Hive Service Advanced Configuration Snippet (Safety Value) for hive-site.xml, add the following property:
• name = hive.reloadable.aux.jars.path
• value = the path to the JAR file
6. Save changes.
7. In Cloudera Manager Admin Console > Hive service > Actions, redeploy the Hive client configuration.
8. Restart the Hive service.
This step is only necessary initiallly. Subsequently, you can add or remove JARs using RELOAD.
9. If you use Sentry, on the Hive command line grant privileges on the JAR files to the roles that require access.
10. After configuring the cluster, register the UDF as described above.
• When you register the UDF, you use the CREATE FUNCTION that includes the USING CLAUSE (either Direct JAR
reference or Hive Aux JARs directory methods).
• Other requirements described in Impala documentation.
The CREATE FUNCTION includes the JAR location; otherwise, Impala does not load the function. Impala relies on the
location you provide during function creation. The JAR, which contains the UDF code, must reside on HDFS, making
the JAR automatically available to all the Impala nodes. You do not need to manually copy any UDF-related files between
servers.
If you cannot register the UDF, which you want to call from Impala, in Hive because, for example, you use Sentry, then
register the UDF in Impala. Do not name an Impala-registered UDF the same as any Hive-registered UDF.
Configuring Transient Apache Hive ETL Jobs to Use the Amazon S3 Filesystem in CDH
Apache Hive is a popular choice for batch extract-transform-load (ETL) jobs such as cleaning, serializing, deserializing,
and transforming data. In on-premise deployments, ETL jobs operate on data stored in a permanent Hadoop cluster
that runs HDFS on local disks. However, ETL jobs are frequently transient and can benefit from cloud deployments
where cluster nodes can be quickly created and torn down as needed. This approach can translate to significant cost
savings.
Important:
• Cloudera components writing data to S3 are constrained by the inherent limitation of Amazon
S3 known as "eventual consistency." For more information, see Data Storage Considerations.
• Hive on MapReduce1 is not supported on Amazon S3 in the CDH distribution. Only Hive on
MapReduce2/YARN is supported on S3.
For information about how to set up a shared Amazon Relational Database Service (RDS) as your Hive metastore, see
Configuring a Shared Amazon RDS as an HMS for CDH on page 54. For information about tuning Hive read and write
performance to the Amazon S3 file system, see Tuning Apache Hive Performance on the Amazon S3 Filesystem in CDH
on page 74.
Data residing on Amazon S3 and the node running Altus Director are the only persistent components. The computing
nodes and local storage come and go with each transient workload.
a private subnet. If you deploy in a public subnet, each cluster needs direct connectivity. Inbound connections should
be limited to traffic from private IPs within the VPC and SSH access through port 22 to the gateway nodes from approved
IP addresses. For details about using Altus Director to perform these steps, see Setting up the AWS Environment.
Data Access
Create an IAM role that gives the cluster access to S3 buckets. Using IAM roles is a more secure way to provide access
to S3 than adding the S3 keys to Cloudera Manager by configuring core-site.xml safety valves.
AWS Placement Groups
To improve performance, place worker nodes in an AWS placement group. See Placement Groups in the AWS
documentation set.
If Altus Director server is running in a separate instance from the Altus Director client, you must run:
set -x -e
sudo -u hdfs hadoop fs -mkdir /user/ec2-user
sudo -u hdfs hadoop fs -chown ec2-user:ec2-user /user/ec2-user
hive -f query.q
exit 0
Where query.q contains the Hive query. After you create the job wrapper script, test it to make sure it runs without
errors.
Log Collection
Save all relevant log files in S3 because they disappear when you terminate the transient cluster. Use these log files to
debug any failed jobs after the cluster is terminated. To save the log files, add an additional step to your job wrapper
shell script.
Example for copying Hive logs from a transient cluster node to S3:
# Set Credentials
export AWS_ACCESS_KEY_ID=[]
export AWS_SECRET_ACCESS_KEY=[]
CDH clusters run directly on a shared object store, such as Amazon S3, making it possible for the data to live across
multiple clusters and beyond the lifespan of any cluster. In this scenario, clusters need to regenerate and coordinate
metadata for the underlying shared data individually.
From CDH 5.10 and later, clusters running in the AWS cloud can share a single persistent instance of the Amazon
Relational Database Service (RDS) as the HMS backend database. This enables persistent sharing of metadata beyond
a cluster's life cycle so that subsequent clusters need not regenerate metadata as they had to before.
How To Configure Amazon RDS as the Backend Database for a Shared Hive Metastore
The following instructions assumes that you have an Amazon AWS account and that you are familiar with AWS services.
1. Create a MySQL instance with Amazon RDS. See Creating a MySQL DB Instance... and Creating an RDS Database
Tutorial in Amazon documentation. This step is performed only once. Subsequent clusters that use an existing
RDS instance do not need this step because the RDS is already set up.
2. Configure a remote MySQL Hive metastore database as part of the Cloudera Manager installation procedure,
using the hostname, username, and password configured during your RDS setup. See Configuring a Remote MySQL
Database for the Hive Metastore.
3. Configure Hive, Impala, and Spark to use Amazon S3:
• For Hive, see Tuning Hive on S3.
• For Impala, see Using Impala with the Amazon S3 Filesystem.
• For Spark, see Accessing Data Stored in Amazon S3 through Spark.
Supported Scenarios
The following limitations apply to the jobs you run when you use an RDS server as a remote backend database for Hive
metastore.
• No overlapping data or metadata changes to the same data sets across clusters.
• No reads during data or metadata changes to the same data sets across clusters.
• Overlapping data or metadata changes are defined as when multiple clusters concurrently:
– Make updates to the same table or partitions within the table located on S3.
– Add or change the same parent schema or database.
Important: If you are running a shared RDS, Cloudera Support will help licensed customers repair
any unexpected metadata issues, but will not do "root-cause" analysis.
Microsoft Azure Data Lake Store (ADLS) is a massively scalable distributed file system that can be accessed through an
HDFS-compatible API. ADLS acts as a persistent storage layer for CDH clusters running on Azure. In contrast to Amazon
S3, ADLS more closely resembles native HDFS behavior, providing consistency, file directory structure, and
POSIX-compliant ACLs. See the ADLS documentation for conceptual details.
CDH supports using ADLS as a storage layer for MapReduce2 (MRv2 or YARN), Hive, Hive on Spark, Spark 2.1 and higher,
and Spark 1.6. Other applications are not supported and may not work, even if they use MapReduce or Spark as their
execution engine. Use the steps in this topic to set up a data store to use with these CDH components.
Note the following limitations:
• ADLS is not supported as the default filesystem. Do not set the default file system property (fs.defaultFS) to
an adl:// URI. You can still use ADLS as secondary filesystem while HDFS remains the primary filesystem.
• Hadoop Kerberos authentication is supported, but it is separate from the Azure user used for ADLS authentication.
Important:
While you are creating the service principal, write down the following values, which you will need
in step 4:
• The client id.
• The client secret.
• The refresh URL. To get this value, in the Azure portal, go to Azure Active Directory > App
registrations > Endpoints. In the Endpoints region, copy the OAUTH 2.0 TOKEN ENDPOINT.
This is the value you need for the refresh_URL in step 4.
3. Grant the service principal permission to access the ADLS account. See the Microsoft documentation on
Authorization and access control. Review the section, "Using ACLs for operations on file systems" for information
about granting the service principal permission to access the account.
You can skip the section on RBAC (role-based access control) because RBAC is used for management and you only
need data access.
4. Configure your CDH cluster to access your ADLS account. To access ADLS storage from a CDH cluster, you provide
values for the following properties when submitting jobs:
Client ID dfs.adls.oauth2.client.id
There are several methods you can use to provide these properties to your jobs. There are security and other
considerations for each method. Select one of the following methods to access data in ADLS:
• User-Supplied Key for Each Job on page 57
• Single Master Key for Cluster-Wide Access on page 58
• User-Supplied Key stored in a Hadoop Credential Provider on page 58
• Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for the
service on page 59
If your configuration is correct, this command lists the files in your account.
2. After successfully testing your configuration, you can access the ADLS account from MRv2, Hive, Hive on Spark ,
Spark 1.6, Spark 2.1 and higher, or HBase by using the following URI:
adl://your_account.azuredatalakestore.net
For additional information and examples of using ADLS access with Hadoop components:
• Spark: See Accessing Data Stored in Azure Data Lake Store (ADLS) through Spark
• distcp: See Using DistCp with Microsoft Azure (ADLS).
• TeraGen:
export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=hadoop_credstore_password
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
teragen 1000 adl://jzhugeadls.azuredatalakestore.net/tg
Important: Cloudera recommends that you only use this method for access to ADLS in development
environments or other environments where security is not a concern.
hadoop command
-Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
-Ddfs.adls.oauth2.client.id=CLIENT ID \
-Ddfs.adls.oauth2.credential='CLIENT SECRET' \
-Ddfs.adls.oauth2.refresh.url=REFRESH URL \
adl://<store>.azuredatalakestore.net/src hdfs://nn/tgt
Important: Cloudera recommends that you only use this method for access to ADLS in development
environments or other environments where security is not a concern.
1. Open the Cloudera Manager Admin Console and go to Cluster Name > Configuration > Advanced Configuration
Snippets.
2. Enter the following in the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
<property>
<name>dfs.adls.oauth2.client.id</name>
<value>CLIENT ID</value>
</property>
<property>
<name>dfs.adls.oauth2.credential</name>
<value>CLIENT SECRET</value>
</property>
<property>
<name>dfs.adls.oauth2.refresh.url</name>
<value>REFRESH URL</value>
</property>
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Reference the Credential Provider on the command line when submitting jobs:
hadoop command
-Ddfs.adls.oauth2.access.token.provider.type=ClientCredential \
-Dhadoop.security.credential.provider.path=jceks://hdfs/user/USER_NAME/adls-cred.jceks
\
adl://<store>.azuredatalakestore.net/
Create a Hadoop Credential Provider and reference it in a customized copy of the core-site.xml file for
the service
• Advantages: all users can access the ADLS storage
• Disadvantages: you must pass the path to the credential store on the command line.
1. Create a Credential Provider:
a. Create a password for the Hadoop Credential Provider and export it to the environment:
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
2. Copy the contents of the /etc/service/conf directory to a working directory. The service can be one of the
following verify list:
• yarn
• spark
• spark2
Use the --dereference option when copying the file so that symlinks are correctly resolved. For example:
<property>
<name>hadoop.security.credential.provider.path</name>
<value>jceks://hdfs/path_to_credential_store_file</value>
</property>
<property>
<name>dfs.adls.oauth2.access.token.provider.type</name>
<value>ClientCredential</value>
</property>
The value of the path_to_credential_store_file should be the same as the value for the --provider option in
the hadoop credential create command described in step 1.
4. Set the HADOOP_CONF_DIR environment variable to the location of the working directory:
export HADOOP_CONF_DIR=path_to_working_directory
export HADOOP_CREDSTORE_PASSWORD=password
You can omit the -value option and its value and the command will prompt the user to enter the value.
For more details on the hadoop credential command, see Credential Management (Apache Software
Foundation).
Importing Data
Prerequisites
Before importing data make sure that the following prerequisites are satisfied:
• A properly configured user with permissions to execute CREATE TABLE and LOAD DATA INPATH statements in
Hive.
• Default ACLs defined for the temporary import folder so that the new folder, when created, inherits the ACLs of
the parent.
Steps
1. Create a temporary import folder with read, write, and execute permissions for the Hive user. For example:
The LOAD DATA INPATH statement is executed by the Hive superuser, therefore, the temporary HDFS folder
that Sqoop imports into has to have read, write, and execute permission for the Hive user as well.
Important: Make sure that effective ACLs are not constrained for the Hive user by the
fs.permissions.umask-mode setting.
Important: These numbers are general guidance only, and can be affected by factors such as number
of columns, partitions, complex joins, and client activity. Based on your anticipated deployment, refine
through testing to arrive at the best values for your environment.
In addition, set option MaxMetaspaceSize to put an upper limit on the amount of native memory used for class
metadata.
The following example sets the PermGen space to 512M, uses the new Parallel Collector, and disables the garbage
collection overhead limit:
5. From the Actions drop-down menu, select Restart to restart the HiveServer2 service.
To configure heap size and garbage collection for the Hive metastore:
1. To set heap size, go to Home > Hive > Configuration > Hive Metastore > Resource Management.
2. Set Java Heap Size of Hive Metastore Server in Bytes to the desired value, and click Save Changes.
3. To set garbage collection, go to Home > Hive > Configuration > Hive Metastore Server > Advanced.
4. Set the PermGen space for Java garbage collection to 512M, the type of garbage collector used (ConcMarkSweepGC
or ParNewGC), and enable or disable the garbage collection overhead limit in Java Configuration Options for Hive
Metastore Server. For an example of this setting, see step 4 above for configuring garbage collection for HiveServer2.
5. From the Actions drop-down menu, select Restart to restart the Hive Metastore service.
To configure heap size and garbage collection for the Beeline CLI:
1. To set heap size, go to Home > Hive > Configuration > Gateway > Resource Management.
2. Set Client Java Heap Size in Bytes to at least 2 GiB and click Save Changes.
3. To set garbage collection, go to Home > Hive > Configuration > Gateway > Advanced.
4. Set the PermGen space for Java garbage collection to 512M in Client Java Configuration Options.
The following example sets the PermGen space to 512M and specifies IPv4:
-XX:MaxPermSize=512M -Djava.net.preferIPv4Stack=true
5. From the Actions drop-down menu, select Restart to restart the client service.
else
export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xmx12288m -Xms12288m
-XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
fi
fi
export HADOOP_HEAPSIZE=2048
You can use either the Concurrent Collector or the new Parallel Collector for garbage collection by passing
-XX:+UseConcMarkSweepGC or -XX:+UseParNewGC in the HADOOP_OPTS lines above. To enable the garbage
Set the PermGen space for Java garbage collection to 512M for all in the JAVA-OPTS environment variable. For example:
– Reduce the size of the result set returned by adding filters to queries. This minimizes memory pressure caused
by "dangling" sessions.
– Look for queries that load all table partitions in memory to execute. This can substantially add to memory
pressure. For example, a query that accesses a partitioned table with the following SELECT statement loads
all partitions of the target table to execute:
How to resolve:
– Add partition filters to queries to reduce the total number of partitions that are accessed. To view all of
the partitions processed by a query, run the EXPLAIN DEPENDENCY clause, which is explained in the
Apache Hive Language Manual.
– In the Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml, set the
hive.metastore.limit.partition.request parameter to 1000 to limit the maximum number of
partitions accessed from a single table in a query. See the Apache wiki for information about setting this
parameter. If this parameter is set, queries that access more than 1000 partitions fail with the following
error:
MetaException: Number of partitions scanned (=%d) on table '%s' exceeds limit (=%d)
Setting this parameter protects against bad workloads and identifies queries that need to be optimized.
To resolve the failed queries:
– Apply the appropriate partition filters.
– Increase the cluster-wide limit beyond 1000, if needed. This action adds memory pressure to
HiveServer2 and the Hive metastore.
– If the accessed table is not partitioned, see this Cloudera Engineering Blog post, which explains how to
partition Hive tables to improve query performance. Choose columns or dimensions for partitioning
based upon usage patterns. Partitioning tables too much causes data fragmentation, but partitioning
too little causes queries to read too much data. Either extreme makes querying inefficient. Typically, a
few thousand table partitions is fine.
Note: If dynamic partitioning is enabled, Hive implicitly enables the counters during data load.
By default, CDH restricts the number of MapReduce counters to 120. Hive queries that require more counters fail
with the "Too many counters" error.
How to resolve:
– For managed clusters:
1. In Cloudera Manager Admin Console, go to the MapReduce service.
2. Select the Configuration tab.
3. Type counters in the search box in the right panel.
4. Scroll down the right panel to locate the mapreduce.job.counters.max property and increase the Value.
5. Click Save Changes.
– For unmanaged clusters:
Set the mapreduce.job.counters.max property to a higher value in mapred-site.xml.
Minimum Required Role: Configurator (also provided by Cluster Administrator, Full Administrator)
Hive on Spark provides better performance than Hive on MapReduce while offering the same features. Running Hive
on Spark requires no changes to user queries. Specifically, user-defined functions (UDFs) are fully supported, and most
performance-related configurations work with the same semantics.
This topic describes how to configure and tune Hive on Spark for optimal performance. This topic assumes that your
cluster is managed by Cloudera Manager and that you use YARN as the Spark cluster manager.
The example described in the following sections assumes a 40-host YARN cluster, and each host has 32 cores and 120
GB memory.
YARN Configuration
The YARN properties yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb
determine how cluster resources can be used by Hive on Spark (and other YARN applications). The values for the two
properties are determined by the capacity of your host and the number of other non-YARN applications that coexist
on the same host. Most commonly, only YARN NodeManager and HDFS DataNode services are running on worker
hosts.
Configuring Cores
Allocate 1 core for each of the services and 2 additional cores for OS usage, leaving 28 cores available for YARN.
Configuring Memory
Allocate 20 GB memory for these services and processes. To do so, set
yarn.nodemanager.resource.memory-mb=100 GB and yarn.nodemanager.resource.cpu-vcores=28.
Spark Configuration
After allocating resources to YARN, you define how Spark uses the resources: executor and driver memory, executor
allocation, and parallelism.
set spark.yarn.executor.memoryOverhead=2g;
With these configurations, each host can run up to 7 executors at a time. Each executor can run up to 4 tasks (one per
core). So, each task has on average 3.5 GB (14 / 4) memory. All tasks running in an executor share the same heap space.
Make sure the sum of spark.yarn.executor.memoryOverhead and spark.executor.memory is less than
yarn.scheduler.maximum-allocation-mb.
Parallelism
For available executors to be fully utilized you must run enough tasks concurrently (in parallel). In most cases, Hive
determines parallelism automatically for you, but you may have some control in tuning concurrency. On the input side,
the number of map tasks is equal to the number of splits generated by the input format. For Hive on Spark, the input
format is CombineHiveInputFormat, which can group the splits generated by the underlying input formats as
required. You have more control over parallelism at the stage boundary. Adjust
hive.exec.reducers.bytes.per.reducer to control how much data each reducer processes, and Hive determines
an optimal number of partitions, based on the available executors, executor memory settings, the value you set for
the property, and other factors. Experiments show that Spark is less sensitive than MapReduce to the value you specify
for hive.exec.reducers.bytes.per.reducer, as long as enough tasks are generated to keep all available executors
busy. For optimal performance, pick a value for the property so that Hive generates enough tasks to fully use all available
executors.
For more information on tuning Spark applications, see Tuning Apache Spark Applications.
Hive Configuration
Hive on Spark shares most if not all Hive performance-related configurations. You can tune those parameters much
as you would for MapReduce. However, hive.auto.convert.join.noconditionaltask.size, which is the
threshold for converting common join to map join based on statistics, can have a significant performance impact.
Although this configuration is used for both Hive on MapReduce and Hive on Spark, it is interpreted differently by each.
The size of data is described by two statistics:
• totalSize—Approximate size of data on disk
• rawDataSize—Approximate size of data in memory
Hive on MapReduce uses totalSize. When both are available, Hive on Spark uses rawDataSize. Because of
compression and serialization, a large difference between totalSize and rawDataSize can occur for the same
dataset. For Hive on Spark, you might need to specify a larger value for
hive.auto.convert.join.noconditionaltask.size to convert the same join to a map join. You can increase
the value for this parameter to make map join conversion more aggressive. Converting common joins to map joins can
improve performance. Alternatively, if this value is set too high, too much memory is used by data from small tables,
and tasks may fail because they run out of memory. Adjust this value according to your cluster environment.
You can control whether rawDataSize statistics should be collected, using the property
hive.stats.collect.rawdatasize. Cloudera recommends setting this to true in Hive (the default).
Cloudera also recommends setting two additional configuration properties, using a Cloudera Manager advanced
configuration snippet for HiveServer2:
• hive.stats.fetch.column.stats=true
• hive.optimize.index.filter=true
The following properties are generally recommended for Hive performance tuning, although they are not specific to
Hive on Spark:
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.merge.mapfiles=true
hive.merge.mapredfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.sparkfiles=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=20M(might need to increase for Spark,
200M)
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4 (MR and Spark)
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.optimize.ppd=true
Note: Pre-warming takes a few seconds and is a good practice for short-lived sessions, especially if
the query involves reduce stages. However, if the value of hive.prewarm.numcontainers is higher
than what is available in the cluster, the process can take a maximum of 30 seconds. Use pre-warming
with caution.
hive.blobstore.use.blobstore. When set to true, this parameter enables the true | false false
as.scratchdir use of scratch directories directly on S3.
set hive.mv.files.thread=20
set hive.blobstore.use.blobstore.as.scratchdir=true
the S3A Connector is used with a HiveServer2 instance, so different Hive queries can share the same connector instance.
The same thread pool is used to issue upload and copy requests. This means that the fs.s3a parameters cannot be
set on a per-query basis. Instead, set them for each HiveServer2 instance. In contrast, the thread pool controlled by
hive.mv.files.thread is created for each query separately.
This behavior might occur more frequently if fs.s3a.blocking.executor.enabled is set to true. This
parameter is turned off by default in CDH.
2. S3 is an eventually consistent storage system. See the S3 documentation. This eventual consistency affects Hive
behavior on S3 and, in rare cases, can cause intermittent failures. Retrying the failed query usually works around
the issue.
Tuning Tips
Increase the value set for hive.load.dynamic.partitions.thread to improve dynamic partitioning query
performance on S3. However, do not set this parameter to values exceeding 25 to avoid placing an excessive load on
S3, which can lead to throttling issues.
Setting the Hive Dynamic Partition Loading Parameter on a Per-Query Basis
Optimize dynamic partitioning at the session level by using the Hive SET command in the query code.
For example, to set the thread pool to 25 threads:
set hive.load.dynamic.partitions.thread=25
Setting the Hive Dynamic Partition Loading Parameter as a Service-Wide Default with Cloudera Manager
Use Cloudera Manager to set hive.load.dynamic.partitions.thread as a service-wide default:
1. In the Cloudera Manager Admin Console, go to the Hive service.
2. In the Hive service page, click the Configuration tab.
3. On the Configuration page, click the HiveServer2 scope.
4. Click the Performance category.
5. Search for Load Dynamic Partitions Thread Count and enter the value you want to set as a service-wide default.
6. Click Save Changes.
Important: This optimization only applies to INSERT OVERWRITE queries that insert data into tables
or partitions where already there is existing data.
Tuning Tips
The hive.mv.files.thread parameter can be tuned for INSERT OVERWRITE performance in the same way it is
tuned for write performance. See Hive S3 Write Performance Tuning Parameters on page 74.
If setting the above parameter does not produce acceptable results, you can disable the HDFS trash feature by setting
the fs.trash.interval to 0 on the HDFS service. In Cloudera Manager, choose HDFS > Configuration > NameNode
> Main and set Filesystem Trash Interval to 0.
Warning: Disabling the trash feature of HDFS causes permanent data deletions, making the deleted
data unrecoverable.
Setting the Hive INSERT OVERWRITE Performance Tuning Parameter on a Per-Query Basis
Configure Hive to move data to the HDFS trash directory in parallel for INSERT OVERWRITE queries using the Hive
SET command.
set hive.mv.files.thread=30
Setting the Hive INSERT OVERWRITE Performance Tuning Parameter as a Service-Wide Default with Cloudera Manager
Use Cloudera Manager to set hive.mv.files.thread as a service-wide default:
1. In the Cloudera Manager Admin Console, go to the Hive service.
2. In the Hive service page, click the Configuration tab.
3. On the Configuration page, click the HiveServer2 scope.
4. Click the Performance category.
5. Search for Move Files Thread Count and enter the value you want to set as a service-wide default.
6. Click Save Changes.
These parameters can be set with Cloudera Manager at the service level or on a per-query basis using the Hive SET
command. See Setting Hive Table Partition Read Performance Tuning Parameters as Service-Wide Defaults with Cloudera
Manager on page 79.
Tuning Tips
If listing input files becomes a bottleneck for the Hive query, increase the values for
hive.exec.input.listing.max.threads and
mapreduce.input.fileinputformat.list-status.num-threads. This bottleneck might occur if the query
takes a long time to list input directories or to run split calculations when reading several thousand partitions. However,
do not set these parameters to values over 50 to avoid putting excessive load on S3, which might lead to throttling
issues.
Setting the Hive Table Partition Read Performance Tuning Parameters on a Per-Query Basis
Configure Hive to perform metadata collection in parallel when reading table partitions on S3 using the Hive SET
command.
For example, to set the maximum number of threads that Hive uses to list input files to 20 and the number of threads
used by the FileInputFormat class when listing and fetching block locations for input to 5:
set hive.exec.input.listing.max.threads=20
set mapreduce.input.fileinputformat.list-status.num-threads=5
Setting Hive Table Partition Read Performance Tuning Parameters as Service-Wide Defaults with Cloudera Manager
Use Cloudera Manager to set hive.exec.input.listing.max.threads and
mapreduce.input.fileinputformat.list-status.num-threads as service-wide defaults.
To set hive.exec.input.listing.max.threads:
1. In the Cloudera Manager Admin Console, go to the Hive service.
2. In the Hive service page, click the Configuration tab.
3. On the Configuration page, click the HiveServer2 scope.
4. Click the Performance category.
5. Search for Input Listing Max Threads and enter the value you want to set as a service-wide default.
6. Click Save Changes.
To set mapreduce.input.fileinputformat.list-status.num-threads:
1. In the Cloudera Manager Admin Console, go to the MapReduce service.
2. In the MapReduce service page, click the Configuration tab.
3. Search for MapReduce Service Advanced Configuration Snippet (Safety Valve) for mapred-site.xml and enter
the parameter, value, and description:
<property>
<name>mapreduce.input.fileinputformat.list-status.num-threads</name>
<value>number_of_threads</value>
<description>Number of threads used to list and fetch block locations for input paths
specified by FileInputFormat</description>
</property>
Tuning Tips
The hive.metastore.fshandler.threads parameter can be increased if the MSCK REPAIR TABLE command is
taking excessive time to scan S3 for potential partitions to add. Do not set this parameter to a value higher than 30 to
avoid putting excessive load on S3, which can lead to throttling issues.
Increase the value set for the hive.msck.repair.batch.size parameter if you receive the following exception:
This exception is thrown by HiveServer2 when a metastore operation takes longer to complete than the time specified
for the hive.metastore.client.socket.timeout parameter. If you simply increase the timeout, it must be set
across all metastore operations and requires restarting the metastore service. It is preferable to increase the value set
for hive.msck.repair.batch.size, which specifies the number of partition objects that are added to the metastore
at one time. Increasing hive.msck.repair.batch.size to 3000 can help mitigate timeout exceptions returned
when running MSCK commands. Set to a lower value if you have multiple MSCK commands running in parallel.
set hive.msck.repair.batch.size=3000
Setting the Hive MSCK REPAIR TABLE Tuning Parameters as Service-Wide Defaults with Cloudera Manager
Use Cloudera Manager to set the hive.metastore.fshandler.threads and the hive.msck.repair.batch.size
parameters as service-wide defaults:
1. In the Cloudera Manager Admin Console, go to the Hive service.
2. In the Hive service page, click the Configuration tab.
3. On the Configuration page, search for each parameter to set them.
4. Click Save Changes.
Recommendations
Cloudera recommends that each instance of the metastore runs on a separate cluster host.
Warning:
• In the first step of enabling HiveServer2 high availability below, you enable Hive Delegation Token
Store implementation. Oozie needs this implementation for secure HS2 HA. Otherwise, the Oozie
server can get a delegation token from one HS2 server, but the actual query might run against
another HS2 server, which does not recognize the HS2 delegation token. Exception: If you enable
HMS HA, do not enable Hive Delegation Token Store; otherwise, Oozie job issues occur.
• HiveServer2 high availability does not automatically fail and retry long-running Hive queries. If
any of the HiveServer2 instances fail, all queries running on that instance fail and are not retried.
Instead, the client application must re-submit the queries.
• After you enable HiveServer2 high availability, existing Oozie jobs must be changed to reflect the
HiveServer2 address.
• On Kerberos-enabled clusters, you must use the load balancer's principal to connect to HS2
directly; otherwise, the after you enable HiveServer2 high availability, direct connections to
HiveServer2 instances fail.
2. On the Add Role Instances to Hive page under the HiveServer2 column heading, click Select hosts, and select
the hosts that should have a HiveServer2 instance.
3. Click OK, and then click Continue. The Instances page appears where you can start the new HiveServer2
instances.
4. Click the Configuration tab.
5. Select Scope > HiveServer2.
6. Select Category > Main.
7. Locate the HiveServer2 Load Balancer property or search for it by typing its name in the Search box.
8. Enter values for <hostname>:<port number>. For example, hs2load_balancer.example.com:10015.
Note: When you set the HiveServer2 Load Balancer property, Cloudera Manager regenerates
the keytabs for HiveServer2 roles. The principal in these keytabs contains the load balancer
hostname. If there is a Hue service that depends on this Hive service, it also uses the load balancer
to communicate with Hive.
• If you are using Microsoft Active Directory for your KDC, see Microsoft documentation to create a principal
and keytab for Hive. The principal must be named
hive/<load_balancer_fully_qualified_domain_name> and the keytab must contain all of the Hive
host keytabs for your cluster.
For example, if your load balancer is hs2loadbalancer.example.com and you have two HiveServer2
instances on host hs2-host-1.example.com and hs2-host-2.example.com, if you run klist -ekt
hive-proxy.keytab, it should return the following:
2. While you are still connected to kadmin.local, list the hive/<hs2_hostname> principals:
3. While you are still connected to kadmin.local, create a hive-proxy.keytab, which contains the load balancer
and all of the hive/<hs2_hostname> principals:
Note that a single xst is used per entry, which appends each entry to the keytab. Also note that the -norandkey
parameter is specified. This is required so you do not break existing keytabs.
4. Validate the keytab by running klist:
5. Distribute the hive-proxy.keytab to all HiveServer2 hosts. Make sure that /var/lib/hive exists on each
node and copy the hive-proxy.keytab to /var/lib/hive on each node. Then confirm that permissions are
set to hive:hive on the directory and the keytab:
6. Configure HiveServer2 to use the new keytab and load balancer principal by setting the
hive.server2.authentication.kerberos.principal and the
hive.server2.authentication.kerberos.keytab properties in the hive-site.xml file. For example, to
set these properties for the examples used in the above steps, your hive-site.xml is set as follows:
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/[email protected]</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/var/lib/hive/hive-proxy.keytab</value>
</property>
3. Edit the HAProxy configuration file to listen on port 10000 and point to each HiveServer2 instance. Make sure to
configure for sticky sessions. Here is an example configuration file:
global
# To have these messages end up in /var/log/haproxy.log you will
# need to:
#
# 1) configure syslog to accept network log events. This is done
# by adding the '-r' option to the SYSLOGD_OPTIONS in
# /etc/sysconfig/syslog
#
# 2) configure local2 events to go to the /var/log/haproxy.log
# file. A line like the following can be added to
# /etc/sysconfig/syslog
#
# local2.* /var/log/haproxy.log
#
log 127.0.0.1 local0
#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#
# You might need to adjust timing values to prevent timeouts.
#---------------------------------------------------------------------
defaults
mode http
log global
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
maxconn 3000
contimeout 5000
clitimeout 50000
srvtimeout 50000
#
# This sets up the admin page for HA Proxy at port 25002.
#
listen stats :25002
balance
mode http
stats enable
stats auth username:password
chkconfig haproxy on
5. Start HAProxy:
When query vectorization is enabled, the query engine processes vectors of columns, which greatly improves CPU
utilization for typical query operations like scans, filters, aggregates, and joins.
Using Cloudera Manager to Enable or Disable Query Vectorization for Parquet Files on a Server-wide Basis
For managed clusters, open the Cloudera Manager Admin Console and perform the following steps:
1. Select the Hive service.
2. Click the Configuration tab.
3. Search for enable vectorization.
To view all the available vectorization properties for Hive, search for hiveserver2_vectorized. All the
vectorization properties are in the Performance category.
4. Select the Enable Vectorization Optimization option to enable query vectorization. To disable query vectorization,
uncheck the box that is adjacent to HiveServer2 Default Group.
5. To enable or disable Hive query vectorization for the Parquet file format, set the Exclude Vectorized Input Formats
property in Cloudera Manager as follows:
• To disable vectorization for Parquet files only, set this property to
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
• To enable vectorization for all file formats including Parquet, set this property to Custom and leave the setting
blank.
6. Click Save Changes.
7. Click the Instances tab, and then click the Restart the service (or the instance) for the changes to take effect:
Manually Enabling or Disabling Query Vectorization for Parquet Files on a Server-Wide Basis
To enable query vectorization for Parquet files on unmanaged clusters on a server-wide basis:
• Set the hive.vectorized.execution.enabled property to true in the hive-site.xml file:
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
<description>Enables query vectorization.</description>
</property>
<property>
<name>hive.vectorized.input.format.excludes</name>
<value/>
<description>Does not exclude query vectorization on any file format including
Parquet.</description>
</property>
To disable query vectorization for Parquet files only on unmanaged clusters on a server-wide basis:
• Set the hive.vectorized.execution.enabled property to true in the hive-site.xml file:
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
<description>Enables query vectorization.</description>
</property>
<property>
<name>hive.vectorized.input.format.excludes</name>
<value>org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat</value>
<description>Disables query vectorization on Parquet file formats only.</description>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>true</value>
<description>Enables query vectorization.</description>
</property>
<property>
<name>hive.vectorized.input.format.excludes</name>
<value/>
<description>Does not exclude query vectorization on any file format.</description>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>false</value>
<description>Disables query vectorization on all file formats.</description>
</property>
Enabling or Disabling Hive Query Vectorization for Parquet Files on a Session Basis
Use the Hive SET command to enable or disable query vectorization on an individual session. Enabling or disabling
query vectorization on a session basis is useful to test the effects of vectorization on the execution of specific sets of
queries.
To enable query vectorization for all file formats including Parquet on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.input.format.excludes= ;
Setting hive.vectorized.input.format.excludes to a blank value ensures that this property is unset and that
no file formats are excluded from query vectorization.
To disable query vectorization for Parquet files only on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET
hive.vectorized.input.format.excludes=org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat;
To enable query vectorization for all file formats on an individual session only:
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.input.format.excludes= ;
Setting hive.vectorized.input.format.excludes to a blank value ensures that this property is unset and that
no file formats are excluded from query vectorization.
To disable query vectorization for all file formats on an individual session only:
SET hive.vectorized.execution.enabled=false;
Important: Vectorized execution can still occur for an excluded input format based on whether
row SerDes or vector SerDes are enabled.
Recommendations: Use this property to automatically disable certain file formats from vectorized execution.
Cloudera recommends that you test your workloads on development clusters using vectorization and enable it in
production if you receive significant performance advantages. As an example, if you want to exclude vectorization
only on the ORC file format while keeping vectorization for all other file formats including the Parquet file format,
set this property to org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.
Default Setting: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat which disables query
vectorization for the Parquet file format only.
hive.vectorized.use.checked.expressions
Description: To enhance performance, vectorized expressions operate using wide data types like long and double.
When wide data types are used, numeric overflows can occur during expression evaluation in a different manner
for vectorized expressions than they do for non-vectorized expressions. Consequently, different query results can
be returned for vectorized expressions compared to results returned for non-vectorized expressions. When enabled,
Hive uses vectorized expressions that handle numeric overflows in the same way as non-vectorized expressions
are handled.
Recommendations: Keep this property set to true, if you want results across vectorized and non-vectorized queries
to be consistent.
Default Setting: true
hive.vectorized.use.vectorized.input.format
Description: Enables Hive to take advantage of input formats that support vectorization when they are available.
Recommendations: Enable this property by setting it to true if you have Parquet or ORC workloads that you want
to be vectorized.
Default Setting: true
hive.vectorized.use.vector.serde.deserialize
Description: Enables Hive to use built-in vector SerDes to process text and SequenceFile tables for vectorized query
execution. In addition, this configuration also helps vectorization of intermediate tasks in multi-stage query execution.
Recommendations: Keep this set to false. Setting this property to true might help multi-stage workloads, but
when set to true, it enables text vectorization, which Cloudera does not support.
Table 2: Supported Data Types for Hive Query Vectorization on Parquet Tables
void struct
Supported/Unsupported Functions
Common arithmetic, boolean (for example AND, OR), comparison, mathematical (for example SIN, COS, LOG), date,
and type-cast functions are supported. Also common aggregate functions such as MIN, MAX, COUNT, AVG, and SUM are
also supported. If a function is not supported, the vectorizer attempts to vectorize the function based on the configuration
value specified for hive.vectorized.adaptor.usage.mode. You can set this property to none or chosen. To set
this property in Cloudera Manager, search for the hive.vectorized.adaptor.usage.mode property on the
Configuration page for the Hive service, and set it to none or chosen as appropriate. For unmanaged clusters, set it
manually in the hive-site.xml file for server-wide scope. To set it on a session basis, use the Hive SET command
as described above.
DESCRIBE p_clients;
+------------------+------------+----------+
| col_name | data_type | comment |
+------------------+------------+----------+
| name | string | |
| symbol | string | |
| lastsale | double | |
| marketlabel | string | |
| marketamount | bigint | |
| ipoyear | int | |
| segment | string | |
| business | string | |
| quote | string | |
+------------------+------------+----------+
To get the query execution plan for a query, enter the following commands in a Beeline session:
Figure 4: EXPLAIN VECTORIZATION Query Execution Plan for Hive Table Using the Parquet Format
By using the EXPLAIN VECTORIZATION statement with your queries, you can find out before you deploy them whether
vectorization will be triggered and what properties you must set to enable it.
Hive/Impala Replication
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
Note: If your deployment includes tables backed by Kudu, BDR filters out Kudu tables for a Hive
replication in order to prevent data loss or corruption. Even though BDR does not replicate the data
in the Kudu tables, it might replicate the tables' metadata entries to the destination.
4. Add the HOST_WHITELIST property. Enter a comma-separated list of hostnames to use for Hive/Impala replication.
For example:
HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com
5. Enter a Reason for change, and then click Save Changes to commit the changes.
Replication of Parameters
Hive replication replicates parameters of databases, tables, partitions, table column stats, indexes, views, and Hive
UDFs.
You can disable replication of parameters:
1. Log in to the Cloudera Manager Admin Console.
2. Go to the Hive service.
3. Click the Configuration tab.
4. Search for "Hive Replication Environment Advanced Configuration Snippet"
5. Add the following parameter:
REPLICATE_PARAMETERS=false
• If a Hive replication schedule is created to replicate a database, ensure all the HDFS paths for the tables in that
database are either snapshottable or under a snapshottable root. For example, if the database that is being
replicated has external tables, all the external table HDFS data locations should be snapshottable too. Failing to
do so will cause BDR to fail to generate a diff report. Without a diff report, BDR will not use snapshot diff.
• After every replication, BDR retains a snapshot on the source cluster. Using the snapshot copy on the source
cluster, BDR performs incremental backups for the next replication cycle. BDR retains snapshots on the source
cluster only if:
– Source and target clusters in the Cloudera Manager are 5.15 and higher
– Source and target CDH are 5.13.3+, 5.14.2+, and 5.15+ respectively
To perform the replication, the destination cluster must be managed by Cloudera Manager 6.1.0 or higher. The source
cluster must run Cloudera Manager 5.14.0 or higher in order to be able to replicate to Cloudera Manager 6. For more
information about supported replication scenarios, see Supported Replication Scenarios.
Note: In replication scenarios where a destination cluster has multiple source clusters, all the source
clusters must either be secure or insecure. BDR does not support replication from a mixture of secure
and insecure source clusters.
To enable replication from an insecure cluster to a secure cluster, you need a user that exists on all the hosts on both
the source cluster and destination cluster. Specify this user in the Run As Username field when you create a replication
schedule.
The following steps describe how to add a user:
1. On a host in the source or destination cluster, add a user with the following command:
2. Set the permissions for the user directory with the following command:
For example, the following command makes milton the owner of the milton directory:
3. Create the supergroup group for the user you created in step 1 with the following command:
groupadd supergroup
4. Add the user you created in step 1 to the group you created:
5. Repeat this process for all hosts in the source and destination clusters so that the user and group exists on all of
them.
After you complete this process, specify the user you created in the Run As Username field when you create a replication
schedule.
Note: If you are replicating to or from S3 or ADLS, follow the steps under Hive/Impala Replication
To and From Cloud Storage on page 106 before completing these steps.
a. Use the Name field to provide a unique name for the replication schedule.
b. Use the Source drop-down list to select the cluster with the Hive service you want to replicate.
c. Use the Destination drop-down list to select the destination for the replication. If there is only one Hive
service managed by Cloudera Manager available as a destination, this is specified as the destination. If more
than one Hive service is managed by this Cloudera Manager, select from among them.
d. Based on the type of destination cluster you plan to use, select one of these two options:
• Use HDFS Destination
• Use Cloud Destination
Note: For using cloud storage in the target cluster, you must set up a valid cloud storage
account and verify that the cloud storage has enough space to save the replicated data.
g. Select a Schedule:
• Immediate - Run the schedule Immediately.
• Once - Run the schedule one time in the future. Set the date and time.
• Recurring - Run the schedule periodically in the future. Set the date, time, and interval between runs.
h. To specify the user that should run the MapReduce job, use the Run As Username option. By default,
MapReduce jobs run as hdfs. To run the MapReduce job as a different user, enter the user name. If you are
using Kerberos, you must provide a user name here, and it must have an ID greater than 1000.
Note: The user running the MapReduce job should have read and execute permissions
on the Hive warehouse directory on the source cluster. If you configure the replication job
to preserve permissions, superuser privileges are required on the destination cluster.
i. Specify the Run on peer as Username option if the peer cluster is configured with a different superuser. This
is only applicable while working in a kerberized environment.
6. Select the Resources tab to configure the following:
• Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter is used by
the MapReduce Service you specified when Cloudera Manager executes the MapReduce job for the replication.
The job specifies the value using one of these properties:
– MapReduce – Fair scheduler: mapred.fairscheduler.pool
– MapReduce – Capacity scheduler: queue.name
– YARN – mapreduce.job.queuename
• Maximum Map Slots and Maximum Bandwidth – Limits for the number of map slots and for bandwidth per
mapper. The default is 100 MB.
• Replication Strategy – Whether file replication should be static (the default) or dynamic. Static replication
distributes file replication tasks among the mappers up front to achieve a uniform distribution based on file
sizes. Dynamic replication distributes file replication tasks in small sets to the mappers, and as each mapper
processes its tasks, it dynamically acquires and processes the next unallocated set of tasks.
7. Select the Advanced tab to specify an export location, modify the parameters of the MapReduce job that will
perform the replication, and set other options. You can select a MapReduce service (if there is more than one in
your cluster) and change the following parameters:
• Uncheck the Replicate HDFS Files checkbox to skip replicating the associated data files.
• If both the source and destination clusters use CDH 5.7.0 or later up to and including 5.11.x, select the
Replicate Impala Metadata drop-down list and select No to avoid redundant replication of Impala metadata.
(This option only displays when supported by both source and destination clusters.) You can select the
following options for Replicate Impala Metadata:
– Yes – replicates the Impala metadata.
– No – does not replicate the Impala metadata.
– Auto – Cloudera Manager determines whether or not to replicate the Impala metadata based on the
CDH version.
To replicate Impala UDFs when the version of CDH managed by Cloudera Manager is 5.7 or lower, see
Replicating Data to Impala Clusters for information on when to select this option.
• The Force Overwrite option, if checked, forces overwriting data in the destination metastore if incompatible
changes are detected. For example, if the destination metastore was modified, and a new partition was added
to a table, this option forces deletion of that partition, overwriting the table with the version found on the
source.
Important: If the Force Overwrite option is not set, and the Hive/Impala replication process
detects incompatible changes on the source cluster, Hive/Impala replication fails. This
sometimes occurs with recurring replications, where the metadata associated with an existing
database or table on the source cluster changes over time.
process user of the HDFS service on the destination cluster. To override the default HDFS location for this
export file, specify a path in the Export Path field.
Note: In a Kerberized cluster, the HDFS principal on the source cluster must have read,
write, and execute access to the Export Path directory on the destination cluster.
• Number of concurrent HMS connections - The number of concurrent Hive Metastore connections. These
connections are used to concurrently import and export metadata from Hive. Increasing the number of
threads can improve BDR performance. By default, any new replication schedules will use 5 connections.
If you set the value to 1 or more, BDR uses multi-threading with the number of connections specified. If you
set the value to 0 or fewer, BDR uses single threading and a single connection.
Note that the source and destination clusters must run a Cloudera Manager version that supports concurrent
HMS connections, Cloudera Manager 5.15.0 or higher and Cloudera Manager 6.1.0 or higher.
• By default, Hive HDFS data files (for example, /user/hive/warehouse/db1/t1) are replicated to a location
relative to "/" (in this example, to /user/hive/warehouse/db1/t1). To override the default, enter a path
in the HDFS Destination Path field. For example, if you enter /ReplicatedData, the data files would be
replicated to /ReplicatedData/user/hive/warehouse/db1/t1.
• Select the MapReduce Service to use for this replication (if there is more than one in your cluster).
• Log Path - An alternative path for the logs.
• Description - A description for the replication schedule.
• Skip Checksum Checks - Whether to skip checksum checks, which are performed by default.
Checksums are used for two purposes:
• To skip replication of files that have already been copied. If Skip Checksum Checks is selected, the
replication job skips copying a file if the file lengths and modification times are identical between the
source and destination clusters. Otherwise, the job copies the file from the source to the destination.
• To redundantly verify the integrity of data. However, checksums are not required to guarantee accurate
transfers between clusters. HDFS data transfers are protected by checksums during transfer and storage
hardware also uses checksums to ensure that data is accurately stored. These two mechanisms work
together to validate the integrity of the copied data.
• Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to determine
whether they are same or not. If skipped, the file size and last modified time are used to determine if files
are the same or not. Skipping the check improves performance during the mapper phase. Note that if you
select the Skip Checksum Checks option, this check is also skipped.
• Abort on Error - Whether to abort the job on an error. By selecting the check box, files copied up to that
point remain on the destination, but no additional files will be copied. Abort on Error is off by default.
• Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, BDR uses a complete copy to
replicate data. If you select this option, the BDR aborts the replication when it encounters an error instead.
• Delete Policy - Whether files that were on the source should also be deleted from the destination directory.
Options include:
– Keep Deleted Files - Retains the destination files even when they no longer exist at the source. (This is
the default.).
– Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder. (Not supported when
replicating to S3 or ADLS.)
– Delete Permanently - Uses the least amount of space; use with caution.
• Preserve - Whether to preserve the Block Size, Replication Count, and Permissions as they exist on the
source file system, or to use the settings as configured on the destination file system. By default, settings are
preserved on the source.
Note: You must be running as a superuser to preserve permissions. Use the "Run As
Username" option to ensure that is the case.
• Alerts - Whether to generate alerts for various state changes in the replication workflow. You can alert On
Failure, On Start, On Success, or On Abort (when the replication workflow is aborted).
8. Click Save Schedule.
The replication task appears as a row in the Replications Schedule table. See Viewing Replication Schedules on
page 102.
Note: If your replication job takes a long time to complete, and tables change before the replication
finishes, the replication may fail. Consider making the Hive Warehouse Directory and the directories
of any external tables snapshottable, so that the replication job creates snapshots of the directories
before copying the files. See Using Snapshots with Replication.
For previously-run replications, the number of replicated UDFs displays on the Replication History page:
Only one job corresponding to a replication schedule can occur at a time; if another job associated with that same
replication schedule starts before the previous one has finished, the second one is canceled.
You can limit the replication jobs that are displayed by selecting filters on the left. If you do not see an expected
schedule, adjust or clear the filters. Use the search box to search the list of schedules for path, database, or table
names.
The Replication Schedules columns are described in the following table.
Column Description
ID An internally generated ID number that identifies the schedule. Provides a convenient way to
identify a schedule.
Click the ID column label to sort the replication schedule table by ID.
Name The unique name you specify when you create a schedule.
Type The type of replication scheduled, either HDFS or Hive.
Source The source cluster for the replication.
Destination The destination cluster for the replication.
Throughput Average throughput per mapper/file of all the files written. Note that throughput does not
include the following information: the combined throughput of all mappers and the time taken
to perform a checksum on a file after the file is written.
Progress The progress of the replication.
Last Run The date and time when the replication last ran. Displays None if the scheduled replication has
not yet been run. Click the date and time link to view the Replication History page for the
replication.
Displays one of the following icons:
• - Successful. Displays the date and time of the last run replication.
• - Failed. Displays the date and time of a failed replication.
• - None. This scheduled replication has not yet run.
•
- Running. Displays a spinner and bar showing the progress of the replication.
Column Description
Click the Last Run column label to sort the Replication Schedules table by the last run date.
Next Run The date and time when the next replication is scheduled, based on the schedule parameters
specified for the schedule. Hover over the date to view additional details about the scheduled
replication.
Click the Next Run column label to sort the Replication Schedules table by the next run date.
Objects Displays on the bottom line of each row, depending on the type of replication:
• Hive - A list of tables selected for replication.
• HDFS - A list of paths selected for replication.
For example:
Actions The following items are available from the Action button:
• Show History - Opens the Replication History page for a replication. See Viewing Replication
History.
• Edit Configuration - Opens the Edit Replication Schedule page.
• Dry Run - Simulates a run of the replication task but does not actually copy any files or
tables. After a Dry Run, you can select Show History, which opens the Replication History
page where you can view any error messages and the number and size of files or tables
that would be copied in an actual replication.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you
to collect replication-specific diagnostic data for the last 10 runs of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to Cloudera
Support. You can also enter a ticket number and comments when sending the bundle.
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• Run Now - Runs the replication task immediately.
• Disable | Enable - Disables or enables the replication schedule. No further replications are
scheduled for disabled replication schedules.
• Delete - Deletes the schedule. Deleting a replication schedule does not delete copied files
or tables.
• While a job is in progress, the Last Run column displays a spinner and progress bar, and each stage of the replication
task is indicated in the message beneath the job's row. Click the Command Details link to view details about the
execution of the command.
• If the job is successful, the number of files copied is indicated. If there have been no changes to a file at the source
since the previous job, then that file is not copied. As a result, after the initial job, only a subset of the files may
actually be copied, and this is indicated in the success message.
• If the job fails, the icon displays.
• To view more information about a completed job, select Actions > Show History. See Viewing Replication History.
The Replication History page displays a table of previously run replication jobs with the following columns:
Column Description
Start Time Time when the replication job started.
Expand the display and show details of the replication. In this screen, you can:
• Click the View link to open the Command Details page, which displays details and
messages about each step in the execution of the command. Expand the display for a
Step to:
Column Description
– View the actual command string.
– View the Start time and duration of the command.
– Click the Context link to view the service status page relevant to the command.
– Select one of the tabs to view the Role Log, stdout, and stderr for the command.
See Viewing Running and Recent Commands.
• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows
you to collect replication-specific diagnostic data for this run of the schedule:
1. Select Send Diagnostic Data to Cloudera to automatically send the bundle to
Cloudera Support. You can also enter a ticket number and comments when sending
the bundle.
2. Click Collect and Send Diagnostic Data to generate the bundle and open the
Replications Diagnostics Command screen.
3. When the command finishes, click Download Result Data to download a zip file
containing the bundle.
• (HDFS only) Link to view details on the MapReduce Job used for the replication. See
Viewing and Filtering MapReduce Activities.
• (Dry Run only) View the number of Replicable Files. Displays the number of files that
would be replicated during an actual replication.
• (Dry Run only) View the number of Replicable Bytes. Displays the number of bytes that
would be replicated during an actual replication.
• Link to download a CSV file containing a Replication Report. This file lists the databases
and tables that were replicated.
• View the number of Errors that occurred during the replication.
• View the number of Impala UDFs replicated. (Displays only for Hive/Impala replications
where Replicate Impala Metadata is selected.)
• Click the link to download a CSV file containing a Download Listing. This file lists the files
and directories that were replicated.
• Click the link to download a CSV file containing Download Status.
• If a user was specified in the Run As Username field when creating the replication job,
the selected user displays.
• View messages returned from the replication job.
credentials to access the S3 or ADLS account. Additionally, you must create buckets in S3 or a data lake store in ADLS
to store the replicated files.
When you replicate data to cloud storage with BDR, BDR also backs up file metadata, including extended attributes
and ACLs.
To configure Hive/Impala replication to or from S3 or ADLS:
1. Create AWS Credentials or Azure Credentials. See How to Configure AWS Credentials or Configuring ADLS Access
Using Cloudera Manager.
Important: If AWS S3 access keys are rotated, you must restart Cloudera Manager server;
otherwise, Hive replication fails.
s3a://S3_bucket_name/path
adl://<accountname>.azuredatalakestore.net/<path>
s3a://S3_bucket_name/path_to_metadata_file
adl://<accountname>.azuredatalakestore.net/<path_to_metadata_file>
6. Complete the configuration of the Hive/Impala replication schedule by following the steps under Configuring
Replication of Hive/Impala Data on page 98, beginning with step 5.f on page 99
Ensure that the following basic permissions are available to provide read-write access to S3 through the S3A connector:
s3:Get*
s3:Delete*
s3:Put*
s3:ListBucket
s3:ListBucketMultipartUploads
s3:AbortMultipartUpload
Note: This page contains references to CDH 5 components or features that have been removed from
CDH 6. These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager
6. For more information, see Deprecated Items.
You can monitor the progress of a Hive/Impala replication schedule using performance data that you download as a
CSV file from the Cloudera Manager Admin console. This file contains information about the tables and partitions being
replicated, the average throughput, and other details that can help diagnose performance issues during Hive/Impala
replications. You can view this performance data for running Hive/Impala replication jobs and for completed jobs.
To view the performance data for a running Hive/Impala replication schedule:
1. Go to Backup > Replication Schedules.
2. Locate the row for the schedule.
3. Click Performance Reports and select one of the following options:
• HDFS Performance Summary – downloads a summary performance report of the HDFS phase of the running
Hive replication job.
• HDFS Performance Full – downloads a full performance report of the HDFS phase of the running Hive replication
job.
• Hive Performance – downloads a report of Hive performance.
4. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
To view the performance data for a completed Hive/Impala replication schedule:
1. Go to Backup > Replication Schedules.
2. Locate the schedule and click Actions > Show History.
The Replication History page for the replication schedule displays.
3. Click to expand the display of the selected schedule.
4. To view performance of the Hive phase, click Download CSV next to the Hive Replication Report label and select
one of the following options:
• Results – download a listing of replicated tables.
• Performance – download a performance report for the Hive replication.
Note: The option to download the HDFS Replication Report might not appear if the HDFS phase
of the replication skipped all HDFS files because they have not changed, or if the Replicate HDFS
Files option (located on the Advanced tab when creating Hive/Impala replication schedules) is
not selected.
See Table 5: Hive Performance Report Columns on page 111 for a description of the data in the HDFS performance
reports.
5. To view performance of the HDFS phase, click Download CSV next to the HDFS Replication Report label and select
one of the following options:
• Listing – a list of files and directories copied during the replication job.
• Status - full status report of files where the status of the replication is one of the following:
– ERROR – An error occurred and the file was not copied.
– DELETED – A deleted file.
– SKIPPED – A file where the replication was skipped because it was up-to-date.
• Error Status Only – full status report, filtered to show files with errors only.
• Deleted Status Only – full status report, filtered to show deleted files only.
• Skipped Status Only– full status report, filtered to show skipped files only.
• Performance – summary performance report.
• Full Performance – full performance report.
See Table 1 for a description of the data in the HDFS performance reports.
6. To view the data, import the file into a spreadsheet program such as Microsoft Excel.
The performance data is collected every two minutes. Therefore, no data is available during the initial execution of a
replication job because not enough samples are available to estimate throughput and other reported data.
The data returned by the CSV files downloaded from the Cloudera Manager Admin console has the following structure:
• If you employ a proxy user with the form user@domain, performance data is not available through the links.
• If the replication job only replicates small files that can be transferred in less than a few minutes, no performance
statistics are collected.
• For replication schedules that specify the Dynamic Replication Strategy, statistics regarding the last file transferred
by a MapReduce job hide previous transfers performed by that MapReduce job.
• Only the last trace of each MapReduce job is reported in the CSV file.
Important: Cloudera does not support Apache Ranger or Hive's native authorization frameworks
for configuring access control in Hive. Use Cloudera-supported Apache Sentry instead.
• Encryption to secure the network connection between HiveServer2 and Hive clients.
In CDH 5.5 and later, encryption between HiveServer2 and its clients has been decoupled from Kerberos
authentication. (Prior to CDH 5.5, SASL QOP encryption for JDBC client drivers required connections authenticated
by Kerberos.) De-coupling the authentication process from the transport-layer encryption process means that
HiveServer2 can support two different approaches to encryption between the service and its clients (Beeline,
JDBC/ODBC) regardless of whether Kerberos is being used for authentication, specifically:
• SASL
• TLS/SSL
Unlike TLS/SSL, SASL QOP encryption does not require certificates and is aimed at protecting core Hadoop RPC
communications. However, SASL QOP may have performance issues when handling large amounts of data, so
depending on your usage patterns, TLS/SSL may be a better choice. See the following topics for details about
configuring HiveServer2 services and clients for TLS/SSL and SASL QOP encryption.
See Configuring Encrypted Communication Between HiveServer2 and Client Drivers on page 114 for details.
See the appropriate How-To guide from the above list for more information.
Property Description
Enable TLS/SSL for HiveServer2 Click the checkbox to enable encrypted client-server communications between
HiveServer2 and its clients using TLS/SSL.
HiveServer2 TLS/SSL Server JKS Enter the path to the Java keystore on the host system. For example:
Keystore File Location
/opt/cloudera/security/pki/server-name-server.jks
HiveServer2 TLS/SSL Server JKS Enter the password for the keystore that was passed at the Java keytool
Keystore File Password command-line when the key and keystore were created. As detailed in How To
Obtain and Deploy Keys and Certificates for TLS/SSL, the password for the
keystore must be the same as the password for the key.
HiveServer2 TLS/SSL Certificate Enter the path to the Java trust store on the host system. Cloudera clusters are
Trust Store File typically configured to use the alternative trust store, jssecacerts, set up at
$JAVA_HOME/jre/lib/security/jssecacerts.
For example:
The entry field for certificate trust store password has been left empty because the trust store is typically not
password protected—it contains no keys, only publicly available certificates that help establish the chain of trust
during the TLS/SSL handshake. In addition, reading the trust store does not require the password.
7. Click Save Changes.
8. Restart the Hive service.
Note: The trust store may have been password protected to prevent its contents from being modified.
However, password protected trust stores can be read from without using the password.
The client needs the path to the trust store when attempting to connect to HiveServer2 using TLS/SSL. This can be
specified using two different approaches, as follows:
• Pass the path to the trust store each time you connect to HiveServer in the JDBC connection string:
jdbc:hive2://fqdn.example.com:10000/default;ssl=true;\
sslTrustStore=$JAVA_HOME/jre/lib/security/jssecacerts;trustStorePassword=extraneous
or,
• Set the path to the trust store one time in the Java system javax.net.ssl.trustStore property:
java
-Djavax.net.ssl.trustStore=/usr/java/jdk1.7.0_67-cloudera/jre/lib/security/jssecacerts
\
-Djavax.net.ssl.trustStorePassword=extraneous MyClass \
jdbc:hive2://fqdn.example.com:10000/default;ssl=true
To support encryption for communications between client and server processes, specify the QOP auth-conf setting
for the SASL QOP property in the HiveServer2 configuration file (hive-site.xml). For example,
<property>
<name>hive.server2.thrift.sasl.qop</name>
<value>auth-conf</value>
</property>
The _HOST is a wildcard placeholder that gets automatically replaced with the fully qualified domain name (FQDN) of
the server running the HiveServer2 daemon process.
Important:
• When Sentry is enabled, you must use Beeline to execute Hive queries. Hive CLI is not supported
with Sentry and must be disabled. See Disabling Hive CLI for information on how to disable the
Hive CLI.
• There are some differences in syntax between Hive and the corresponding Impala SQL statements.
For Impala syntax, see SQL Statements.
• No privilege is required to drop a function. Any user can drop a function.
Sentry supports column-level authorization with the SELECT privilege. Information about column-level authorization
is in the Column-Level Authorization on page 124 section of this page.
See the sections below for details about the supported statements and privileges:
In Hive, the ALTER TABLE statement also sets the owner of a view. Use the following commands to grant the OWNER
privilege on a view:
Sentry only allows you to grant roles to groups that have alphanumeric characters and underscores (_) in the group
name. If the group name contains a non-alphanumeric character that is not an underscore, you can put the group
name in backticks (`) to execute the command. For example, Sentry will return an error for the following command:
GRANT
<privilege> [, <privilege> ]
ON <object type> <object name>
TO ROLE <role name> [,ROLE <role name>]
The following table describes the privileges you can grant and the objects that they apply to:
Privilege Object
ALL Server, database, table, URI
CREATE Server, database
INSERT Server, database, table
REFRESH (Impala only) Server, database, table
SELECT Server, database, table, view, column
You can also grant the SELECT privilege on a specific column of a table with the following statement:
GRANT SELECT <column name> ON TABLE <table name> TO ROLE <role name>;
configuration provided in the fs.defaultFS property. Using the same HDFS configuration, Sentry can also
auto-complete URIs in case the URI is missing a scheme and an authority component.
When a user attempts to access a URI, Sentry will check to see if the user has the required privileges. During the
authorization check, if the URI is incomplete, Sentry will complete the URI using the default HDFS scheme. Note that
Sentry does not check URI schemes for completion when they are being used to grant privileges. This is because users
can GRANT privileges on URIs that do not have a complete scheme or do not already exist on the filesystem.
For example, in CDH 5.8 and later, the following CREATE EXTERNAL TABLE statement works even though the statement
does not include the URI scheme.
Similarly, the following CREATE EXTERNAL TABLE statement works even though it is missing scheme and authority
components.
Since Sentry supports both HDFS and Amazon S3, in CDH 5.8 and later, Cloudera recommends that you specify the
fully qualified URI in GRANT statements to avoid confusion. If the underlying storage is a mix of S3 and HDFS, the risk
of granting the wrong privileges increases. The following are examples of fully qualified URIs:
• HDFS: hdfs://host:port/path/to/hdfs/table
• S3: s3a://host:port/path/to/s3/table
REVOKE
<privilege> [, <privilege> ]
ON <object type> <object name>
FROM ROLE <role name> [,ROLE <role name>]
For example, you can revoke previously-granted SELECT privileges on specific columns of a table with the following
statement:
REVOKE SELECT <column name> ON TABLE <table name> FROM ROLE <role name>;
GRANT
<privilege>
ON <object type> <object name>
TO ROLE <role name>
WITH GRANT OPTION
When you use the WITH GRANT OPTION clause, the ability to grant and revoke privileges applies to the object container
and all its children. For example, if you give GRANT privileges to a role at the database level, that role can grant and
revoke privileges to and from the database and all the tables in the database.
Only a role with the GRANT option on a privilege can revoke that privilege from other roles. And you cannot revoke
the GRANT privilege from a role without also revoking the privilege. To revoke the GRANT privilege, revoke the privilege
that it applies to and then grant that privilege again without the WITH GRANT OPTION clause.
You can use the WITH GRANT OPTION clause with the following privileges:
• ALL
• CREATE
• INSERT
• REFRESH (Impala only)
• SELECT
The coffee_bean role can grant SELECT privileges to other roles on the coffee_database and all the tables within that
database.
When you revoke a privilege from a role, the GRANT privilege is also revoked from that role. For example, if you revoke
SELECT privileges from the coffee_bean role with this command:
The coffee_bean role can no longer grant SELECT privileges on the coffee_database or its tables.
To remove the WITH GRANT OPTION privilege from the coffee_bean role and still allow the role to have SELECT privileges
on the coffee_database, you must run these two commands:
SHOW Statement
• Lists the database(s) for which the current user has database, table, or column-level access:
SHOW DATABASES;
• Lists the table(s) for which the current user has table or column-level access:
SHOW TABLES;
• Lists the column(s) to which the current user has SELECT access:
• Lists all the roles in the system (only for sentry admin users):
SHOW ROLES;
• Lists all the roles in effect for the current user session:
• Lists all the roles assigned to the given group name (only allowed for Sentry admin users and others users that
are part of the group specified by group name):
• The SHOW statement can also be used to list the privileges that have been granted to a role or all the grants given
to a role for a particular object.
It lists all the grants for the given <role name> (only allowed for Sentry admin users and other users that have
been granted the role specified by <role name>). The following command will also list any column-level privileges:
• Lists all the grants for a role or user on the given <object name> (only allowed for Sentry admin users and other
users that have been granted the role specified by <role name>). The following command will also list any
column-level privileges:
• Lists the roles and users that have grants on the Hive object. It does not show inherited grants from a parent
object. It only shows grants that are applied directly to the object. This command is only available for Hive.
• In Hive, this statement lists all the privileges the user has on objects. In Impala, this statement shows the privileges
the user has and the privileges the user's roles have on objects.
Privileges
Sentry supports the following privilege types:
CREATE
The CREATE privilege allows a user to create databases, tables, and functions. Note that to create a function, the user
also must have ALL permissions on the JAR where the function is located, i.e. GRANT ALL ON URI is required.
You can grant the CREATE privilege on a server or database with the following commands, respectively:
You can use the GRANT CREATE statement with the WITH GRANT OPTION clause. The WITH GRANT OPTION clause
allows the granted role to grant the privilege to other roles on the system. See GRANT <Privilege> ... WITH GRANT
OPTION on page 120 for more information about how to use the clause.
The following table shows the CREATE privilege scope:
OWNER
The OWNER privilege gives a user or role special privileges on a database, table, or view in HMS. An object can only
have one owner at a time. For more information about the OWNER privilege, see Object Ownership.
The owner of an object can execute any action on the object, similar to the ALL privilege. However, the object owner
cannot transfer object ownership unless the ALL privileges with GRANT option is selected. You can specify the privileges
that an object owner has on the object with the OWNER Privileges for Sentry Policy Database Objects setting in
Cloudera Manager.
The following table shows the OWNER privilege scope:
Table / View Any action allowed by the ALL privilege on the table except
transferring ownership of the table or view.
WITH GRANT enabled: Allows the user or role to transfer ownership
of the table or view as well as grant and revoke privileges to other
roles on the table or view.
For more information about the OWNER privilege, see Object Ownership.
You can use the GRANT REFRESH statement with the WITH GRANT OPTION clause. The WITH GRANT OPTION clause
allows the granted role to grant the privilege to other roles on the system. See GRANT <Privilege> ... WITH GRANT
OPTION on page 120 for more information about how to use the clause.
The following table shows the REFRESH privilege scope:
SELECT
The SELECT privilege allows a user to view table data and metadata. In additon, you can use the SELECT privilige to
provide column-level authorization. See Column-Level Authorization on page 124 below for details.
You can grant the SELECT privilege on a server, table, or database with the following commands, respectively:
Column-Level Authorization
Sentry provides column-level authorization with the SELECT privilege. You can grant the SELECT privilege to a role
for a subset of columns in a table. If a new column is added to the table, the role will not have the SELECT privilege
on that column until it is explicitly granted.
You can grant and revoke the SELECT privilege on a set of columns with the following commands, respectively:
GRANT SELECT (<column name>) ON TABLE <table name> TO ROLE <role name>;
REVOKE SELECT (<column name>) ON TABLE <table name> FROM ROLE <role name>;
Users with column-level authorization can execute the following commands on the columns that they have access to.
Note that the commands will only return data and metadata for the columns that the user's role has been granted
access to.
• SELECT <column name> FROM TABLE <table name>;
• SELECT COUNT <column name> FROM TABLE <table name>;
• SELECT <column name> FROM TABLE <table name> WHERE <column name> <operator> GROUP BY
<column name>;
• SHOW COLUMNS (FROM|IN) <table name> [(FROM|IN) <database name>];
As a rule, a user with select access to columns in a table cannot perform table-level operations, however, if a user has
SELECT access to all the columns in a table, that user can also execute the following command:
Troubleshooting
HiveServer2 Service Crashes
If the HS2 service crashes frequently, confirm that the problem relates to HS2 heap exhaustion by inspecting the HS2
instance stdout log.
1. In Cloudera Manager, from the home page, go to Hive > Instances.
2. In the Instances page, click the link of the HS2 node that is down:
Figure 9: Link to the Stdout Log on the Cloudera Manager Processes Page
4. Still in Beeline, use the SHOW PARTITIONS command on the employee table that you just created:
This command shows none of the partition directories you created in HDFS because the information about these
partition directories have not been added to the Hive metastore. Here is the output of SHOW PARTITIONS on
the employee table:
+------------+--+
| partition |
+------------+--+
+------------+--+
No rows selected (0.118 seconds)
5. Use MSCK REPAIR TABLE to synchronize the employee table with the metastore:
Now this command returns the partitions you created on the HDFS filesystem because the metadata has been
added to the Hive metastore:
+---------------+--+
| partition |
+---------------+--+
| dept=finance |
| dept=sales |
| dept=service |
+---------------+--+
3 rows selected (0.089 seconds)
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
130 | Cloudera
Appendix: Apache License, Version 2.0
licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their
Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against
any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated
within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under
this License for that Work shall terminate as of the date such litigation is filed.
4. Redistribution.
You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You meet the following conditions:
1. You must give any other recipients of the Work or Derivative Works a copy of this License; and
2. You must cause any modified files to carry prominent notices stating that You changed the files; and
3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark,
and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part
of the Derivative Works; and
4. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute
must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices
that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE
text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along
with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party
notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify
the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or
as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be
construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license
terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as
a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated
in this License.
5. Submission of Contributions.
Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the
Licensor shall be under the terms and conditions of this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement
you may have executed with Licensor regarding such Contributions.
6. Trademarks.
This License does not grant permission to use the trade names, trademarks, service marks, or product names of the
Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing
the content of the NOTICE file.
7. Disclaimer of Warranty.
Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides
its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or
FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or
redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
8. Limitation of Liability.
In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required
by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable
to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising
as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss
of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even
if such Contributor has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability.
Cloudera | 131
Appendix: Apache License, Version 2.0
While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance
of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in
accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any
other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional
liability.
END OF TERMS AND CONDITIONS
http://www.apache.org/licenses/LICENSE-2.0
132 | Cloudera