0% found this document useful (0 votes)
410 views216 pages

CDH5 Security Guide

CDH

Uploaded by

Anuj Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
410 views216 pages

CDH5 Security Guide

CDH

Uploaded by

Anuj Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 216

CDH 5 Security Guide

Important Notice
(c) 2010-2015 Cloudera, Inc. All rights reserved.
Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service
names or slogans contained in this document are trademarks of Cloudera and its
suppliers or licensors, and may not be copied, imitated or used, in whole or in part,
without the prior written permission of Cloudera or the applicable trademark holder.

Hadoop and the Hadoop elephant logo are trademarks of the Apache Software
Foundation. All other trademarks, registered trademarks, product names and
company names or logos mentioned in this document are the property of their
respective owners. Reference to any products, services, processes or other
information, by trade name, trademark, manufacturer, supplier or otherwise does
not constitute or imply endorsement, sponsorship or recommendation thereof by
us.
Complying with all applicable copyright laws is the responsibility of the user. Without
limiting the rights under copyright, no part of this document may be reproduced,
stored in or introduced into a retrieval system, or transmitted in any form or by any
means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Cloudera.
Cloudera may have patents, patent applications, trademarks, copyrights, or other
intellectual property rights covering subject matter in this document. Except as
expressly provided in any written license agreement from Cloudera, the furnishing
of this document does not give you any license to these patents, trademarks
copyrights, or other intellectual property. For information about patents covering
Cloudera products, see http://tiny.cloudera.com/patents.
The information in this document is subject to change without notice. Cloudera
shall not be liable for any damages resulting from technical errors or omissions
which may be present in this document, or from use of this document.
Cloudera, Inc.
1001 Page Mill Road Bldg 2
Palo Alto, CA 94304
[email protected]
US: 1-888-789-1488
Intl: 1-650-362-0488
www.cloudera.com
Release Information
Version: 5.1.x
Date: April 16, 2014

Table of Contents
About this Guide ......................................................................................................11
Introduction to Hadoop Security............................................................................13
Hadoop Users in CDH 5...........................................................................................15
Configuring Hadoop Security in CDH 5..................................................................19
Step 1: Install CDH 5..................................................................................................................................20
Step 2: Verify User Accounts and Groups in CDH 5 Due to Security.....................................................20
Step 2a (MRv1 only): Verify User Accounts and Groups in MRv1......................................................................20
MRv1: Directory Ownership in the Local File System.........................................................................................21
MRv1: Directory Ownership on HDFS...................................................................................................................21
Step 2b (YARN only): Verify User Accounts and Groups in YARN......................................................................22
YARN: Directory Ownership in the Local File System.........................................................................................22
YARN: Directory Ownership on HDFS...................................................................................................................23

Step 3: If you are Using AES-256 Encryption, install the JCE Policy File..............................................24
Step 4: Create and Deploy the Kerberos Principals and Keytab Files..................................................25
When to Use kadmin.local and kadmin...............................................................................................................25
To create the Kerberos principals.........................................................................................................................26
To create the Kerberos keytab files......................................................................................................................27
To deploy the Kerberos keytab files.....................................................................................................................27

Step 5: Shut Down the Cluster.................................................................................................................28


Step 6: Enable Hadoop Security...............................................................................................................28
Step 7: Configure Secure HDFS.................................................................................................................29
To configure secure HDFS......................................................................................................................................30
To enable SSL for HDFS..........................................................................................................................................31

Optional Step 8: Configuring Security for HDFS High Availability........................................................31


Optional Step 9: Configure secure WebHDFS.........................................................................................31
Optional Step 10: Configuring a secure HDFS NFS Gateway................................................................32
Step 11: Set Variables for Secure DataNodes.........................................................................................32
Step 12: Start up the NameNode.............................................................................................................33
Information about the kinit Command................................................................................................................33

Step 12: Start up a DataNode...................................................................................................................34


Step 14: Set the Sticky Bit on HDFS Directories.....................................................................................34
Step 15: Start up the Secondary NameNode (if used)...........................................................................35

Step 16: Configure Either MRv1 Security or YARN Security..................................................................35


Configuring MRv1 Security.......................................................................................................................36
Step 1: Configure Secure MRv1.............................................................................................................................36
Step 2: Start up the JobTracker..............................................................................................................................37
Step 3: Start up a TaskTracker...............................................................................................................................37
Step 4: Try Running a Map/Reduce Job...............................................................................................................38

Configuring YARN Security.......................................................................................................................38


Step 1: Configure Secure YARN.............................................................................................................................38
Step 2: Start up the ResourceManager................................................................................................................40
Step 3: Start up the NodeManager.......................................................................................................................40
Step 4: Start up the MapReduce Job History Server...........................................................................................40
Step 5: Try Running a Map/Reduce YARN Job....................................................................................................41

Enabling HDFS Extended ACLs................................................................................................................41


Enabling ACLs.........................................................................................................................................................41
Commands..............................................................................................................................................................41

Sentry Policy File Configuration.............................................................................43


Prerequisites..............................................................................................................................................43
Roles and Privileges..................................................................................................................................44
Privilege Model...........................................................................................................................................44
Users and Groups......................................................................................................................................45
User to Group Mapping..........................................................................................................................................45

Setup and Configuration...........................................................................................................................46


Installing and Upgrading Sentry...........................................................................................................................46
Policy file..................................................................................................................................................................47
Sample Configuration.............................................................................................................................................49
Enabling Sentry in HiveServer2 ............................................................................................................................51
Securing the Hive Metastore.................................................................................................................................51

Accessing Sentry-Secured Data Outside Hive/Impala..........................................................................52


Scenario One: Authorizing Jobs.............................................................................................................................52
Scenario Two: Authorizing Group Access to Databases.....................................................................................52

Debugging Failed Sentry Authorization Requests.................................................................................53


Appendix: Authorization Privilege Model for Hive and Impala.............................................................53
Object Hierarchy in Hive.........................................................................................................................................53

Sentry Service Configuration..................................................................................57


Prerequisites..............................................................................................................................................57
Privilege Model...........................................................................................................................................58
Users and Groups......................................................................................................................................58
User to Group Mapping..........................................................................................................................................58

Setup and Configuration...........................................................................................................................60


Installing and Upgrading Sentry...........................................................................................................................60

Starting the Sentry Service....................................................................................................................................60

Hive SQL Syntax.........................................................................................................................................61


Example: Using Grant/Revoke Statements to Match an Existing Policy File..................................................63

Configuring HiveServer2 for the Sentry Service.....................................................................................64


Configuring the Hive Metastore for the Sentry Service.........................................................................64
Configuring Impala for the Sentry Service..............................................................................................65
Appendix: Authorization Privilege Model for Hive and Impala.............................................................66
Object Hierarchy in Hive.........................................................................................................................................66

Flume Security Configuration.................................................................................71


Configuring Flume's Security Properties.................................................................................................71
Writing as a single user for all HDFS sinks in a given Flume agent.................................................................71
Writing as different users across multiple HDFS sinks in a single Flume agent............................................72
Limitations..............................................................................................................................................................72

Flume Account Requirements..................................................................................................................73


Testing the Flume HDFS Sink Configuration..........................................................................................73
Writing to a Secure HBase cluster...........................................................................................................73

Hue Security Configuration.....................................................................................75


Hue Security Enhancements....................................................................................................................75
Configuring Hue to Support Hadoop Security using Kerberos.............................................................76
Integrating Hue with LDAP.......................................................................................................................79
Importing LDAP Users and Groups.......................................................................................................................81
Synchronizing LDAP Users and Groups...............................................................................................................82
LDAPS/StartTLS support.......................................................................................................................................83

Configuring Hue for SAML........................................................................................................................83


Step 1: Install swig and openssl packages..........................................................................................................83
Step 2: Install libraries to support SAML in Hue.................................................................................................83
Step 3: Update the Hue configuration file............................................................................................................84
Step 4: Restart the Hue server..............................................................................................................................85

Oozie Security Configuration..................................................................................87


Configuring the Oozie Server to Support Kerberos Security.................................................................87
Configuring Oozie HA with Kerberos.......................................................................................................88
Configuring Oozie to use SSL (HTTPS).....................................................................................................89

HttpFS Security Configuration................................................................................93


Configuring the HttpFS Server to Support Kerberos Security..............................................................93
Using curl to access an URL Protected by Kerberos HTTP SPNEGO....................................................94
Configuring HttpFS to use SSL (HTTPS)...................................................................................................95

HBase Security Configuration.................................................................................99


Configuring HBase Authentication..........................................................................................................99
Step 1: Configure HBase Servers to Authenticate with a Secure HDFS Cluster..............................................99
Step 2: Configure HBase Servers and Clients to Authenticate with a Secure ZooKeeper............................101

Configuring HBase Authorization..........................................................................................................102


Understanding HBase Access Levels.................................................................................................................102
Enable HBase Authorization...............................................................................................................................104
Configure Access Control Lists for Authorization.............................................................................................105

Configuring Secure HBase Replication..................................................................................................105


Configuring the HBase Client TGT Renewal Period..............................................................................106

Impala Security Configuration .............................................................................107


Security Guidelines for Impala...............................................................................................................107
Securing Impala Data and Log Files......................................................................................................108
Installation Considerations for Impala Security...................................................................................109
Securing the Hive Metastore Database................................................................................................109
Securing the Impala Web User Interface..............................................................................................109
Enabling SSL for Impala..........................................................................................................................109
Enabling Sentry Authorization for Impala............................................................................................110
The Sentry Privilege Model..................................................................................................................................110
Starting the impalad Daemon with Sentry Authorization Enabled................................................................111
Using Impala with the Sentry Service (CDH 5.1 or higher only).......................................................................112
Using Impala with the Sentry Policy File...........................................................................................................112
Setting Up Schema Objects for a Secure Impala Deployment.........................................................................117
Privilege Model and Object Hierarchy................................................................................................................118
Debugging Failed Sentry Authorization Requests............................................................................................121
Configuring Per-User Access for Hue.................................................................................................................121
Managing Sentry for Impala through Cloudera Manager................................................................................121
The DEFAULT Database in a Secure Deployment.............................................................................................122

Enabling Kerberos Authentication for Impala......................................................................................122


Requirements for Using Impala with Kerberos.................................................................................................122
Configuring Impala to Support Kerberos Security............................................................................................123
Enabling Kerberos for Impala with a Proxy Server...........................................................................................124
Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO.............................................124

Enabling LDAP Authentication for Impala............................................................................................124


Using Multiple Authentication Methods with Impala.........................................................................126
Auditing Impala Operations...................................................................................................................126
Durability and Performance Considerations for Impala Auditing...................................................................127
Format of the Audit Log Files.............................................................................................................................127
Which Operations Are Audited............................................................................................................................128
Reviewing the Audit Logs....................................................................................................................................128

Hive Security Configuration..................................................................................129


HiveServer2 Security Configuration.......................................................................................................129
Enabling Kerberos Authentication for HiveServer2..........................................................................................129
Encrypted Communication with Client Drivers.................................................................................................130
Using LDAP Username/Password Authentication with HiveServer2............................................................131
Configuring LDAPS Authentication with HiveServer2......................................................................................132
Pluggable Authentication....................................................................................................................................133
Trusted Delegation with HiveServer2.................................................................................................................134
HiveServer2 Impersonation.................................................................................................................................134
Securing the Hive Metastore...............................................................................................................................135
Disabling the Hive Security Configuration.........................................................................................................135

Hive Metastore Server Security Configuration.....................................................................................136


Using Hive to Run Queries on a Secure HBase Server........................................................................137

HCatalog Security Configuration..........................................................................139


Before You Start......................................................................................................................................139
Step 1: Create the HTTP keytab file ......................................................................................................139
Step 2: Configure WebHCat to Use Security.........................................................................................139
Step 3: Create Proxy Users.....................................................................................................................140
Step 4: Verify the Configuration.............................................................................................................140

Llama Security Configuration...............................................................................141


Configuring Llama to Support Kerberos Security.................................................................................141

ZooKeeper Security Configuration.......................................................................143


Configuring the ZooKeeper Server to Support Kerberos Security......................................................143
Configuring the ZooKeeper Client Shell to Support Kerberos Security..............................................144
Verifying the Configuration....................................................................................................................144

Search Security Configuration..............................................................................147


Configuring Search to Use Kerberos......................................................................................................147
Using Kerberos.........................................................................................................................................148
Configuring Sentry for Search................................................................................................................150
Roles and Collection-Level Privileges................................................................................................................151
Users and Groups.................................................................................................................................................151
Setup and Configuration......................................................................................................................................152
Policy File...............................................................................................................................................................152
Sample Configuration...........................................................................................................................................152
Enabling Sentry in Cloudera Search for CDH 5..................................................................................................153

Providing Document-Level Security Using Sentry............................................................................................154


Enabling Secure Impersonation..........................................................................................................................156
Debugging Failed Sentry Authorization Requests............................................................................................157
Appendix: Authorization Privilege Model for Search........................................................................................157

FUSE - Mountable HDFS Security Configuration...............................................161


Sqoop, Pig, and Whirr Security Support Status..................................................163
Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS
Transport............................................................................................................165
Encrypted Shuffle and Encrypted Web UIs...........................................................................................165
Configuring Encrypted Shuffle and Encrypted Web UIs...................................................................................165
Activating Encrypted Shuffle...............................................................................................................................170
Client Certificates.................................................................................................................................................170
Reloading Truststores..........................................................................................................................................170
Debugging.............................................................................................................................................................171

HDFS Encrypted Transport.....................................................................................................................171

Integrating Hadoop Security with Active Directory............................................173


Configuring a Local MIT Kerberos Realm to Trust Active Directory...................................................173
On the Active Directory Server............................................................................................................................173
On the MIT KDC server.........................................................................................................................................174
On all of the cluster machines............................................................................................................................174

Integrating Hadoop Security with Alternate Authentication............................175


Step 1: Configure the AuthenticationFilter to use Kerberos...............................................................175
Step 2: Creating an AltKerberosAuthenticationHandler Subclass.....................................................175
Step 3: Enabling Your AltKerberosAuthenticationHandler Subclass.................................................176
Step 3a: Enabling Your AltKerberosAuthenticationHandler Subclass on Hadoop Web UIs.........................176
Step 3b: Enabling Your AltKerberosAuthenticationHandler Subclass on Oozie Web UI..............................176

Example Implementation for Oozie.......................................................................................................177

Appendix A Troubleshooting.............................................................................179
Sample Kerberos Configuration files: krb5.conf, kdc.conf, kadm5.acl...............................................179
Problem 1: Running any Hadoop command fails after enabling security. .......................................181
Problem 2: Java is unable to read the Kerberos credentials cache created by versions of MIT
Kerberos 1.8.1 or higher. ..................................................................................................................181
Problem 3: java.io.IOException: Incorrect permission.........................................................................182

Problem 4: A cluster fails to run jobs after security is enabled. ........................................................183


Problem 5: The NameNode does not start and KrbException Messages (906) and (31) are
displayed. ...........................................................................................................................................184
Problem 6: The NameNode starts but clients cannot connect to it and error message contains
enctype code 18. ................................................................................................................................185
(MRv1 Only) Problem 7: Jobs won't run and TaskTracker is unable to create a local mapred
directory. .............................................................................................................................................185
(MRv1 Only) Problem 8: Jobs won't run and TaskTracker is unable to create a Hadoop logs
directory. .............................................................................................................................................186
Problem 9: After you enable cross-realm trust, you can run Hadoop commands in the local
realm but not in the remote realm. .................................................................................................187
(MRv1 Only) Problem 10: Jobs won't run and can't access files in mapred.local.dir . ......................187
Problem 11: Users are unable to obtain credentials when running Hadoop jobs or commands.
..............................................................................................................................................................188
Problem 12: Request is a replay exceptions in the logs. ...............................................................188

Appendix B - Information about Other Hadoop Security Programs................191


MRv1 and YARN: The jsvc Program.......................................................................................................191
MRv1 Only: The Linux TaskController Program....................................................................................191
YARN Only: The Linux Container Executor Program............................................................................191

Appendix C - Configuring the Mapping from Kerberos Principals to Short


Names.................................................................................................................193
Mapping Rule Syntax..............................................................................................................................193
Principal Translation............................................................................................................................................193
Acceptance Filter..................................................................................................................................................194
Short Name Substitution.....................................................................................................................................194
Converting Principal Names to Lowercase........................................................................................................194

Example Rules.........................................................................................................................................194
Default Rule..............................................................................................................................................195
Testing Mapping Rules...........................................................................................................................195

Appendix D - Enabling Debugging Output for the Sun Kerberos Classes.......197


Appendix E - Task-controller and Container-executor Error Codes................199
MRv1 ONLY: Task-controller Error Codes.............................................................................................199
YARN ONLY: Container-executor Error Codes......................................................................................201

Appendix F - Using kadmin to Create Kerberos Keytab Files...........................203


To create the Kerberos keytab files.......................................................................................................203

Appendix G - Setting Up a Gateway Node to Restrict Access...........................205


Installing and Configuring the Firewall and Gateway.........................................................................205
Accessing HDFS.......................................................................................................................................205
Submitting and Monitoring Jobs............................................................................................................206

Appendix H - Using a Web Browser to Access an URL Protected by Kerberos


HTTP SPNEGO.....................................................................................................207
Appendix I - Configuring LDAP Group Mappings...............................................211
Appendix J - Before Logging a Support Case......................................................213
Kerberos Issues.......................................................................................................................................213
SSL/TLS Issues........................................................................................................................................213
LDAP Issues.............................................................................................................................................213

Appendix K - Authenticating Kerberos Principals in Java Code.......................215

About this Guide

About this Guide


This CDH 5 Security Guide is for Apache Hadoop developers and system administrators who want to implement
Kerberos security on a CDH 5 cluster. This guide is intended for those who are using CDH 5 because it includes
specific instructions about using the Cloudera packages to configure your system.
This guide includes the following major topics:

CDH 5 Security Guide | 11

Introduction to Hadoop Security

Introduction to Hadoop Security


The security features in CDH 5 enable Hadoop to prevent malicious user impersonation. The Hadoop daemons
leverage Kerberos to perform user authentication on all remote procedure calls (RPCs). Group resolution is
performed on the Hadoop master nodes, NameNode, JobTracker and ResourceManager to guarantee that group
membership cannot be manipulated by users. Map tasks are run under the user account of the user who
submitted the job, ensuring isolation there. In addition to these features, new authorization mechanisms have
been introduced to HDFS and MapReduce to enable more control over user access to data.
The security features in CDH 5 meet the needs of most Hadoop customers because typically the cluster is
accessible only to trusted personnel. In particular, Hadoop's current threat model assumes that users cannot:
1. Have root access to cluster machines.
2. Have root access to shared client machines.
3. Read or modify packets on the network of the cluster.
Note:
CDH 5 supports encryption of all user data sent over the network. For configuration instructions, see
Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport.
Note also that there is no built-in support for on-disk encryption.

CDH 5 Security Guide | 13

Hadoop Users in CDH 5

Hadoop Users in CDH 5


A number of special users are created by default when installing and using CDH & Cloudera Manager. Given
below is a list of users and groups as of the latest CDH 5.1.x release. Also listed below are the corresponding
Kerberos principals and keytab files that should be created when you configure Kerberos security on your cluster.
Table 1: CDH 5 Users & Groups
Project

Unix User ID

Primary Group

Group Members Notes

Apache Avro

No special users.

Apache Flume flume

flume

The sink that writes to HDFS as this user


must have write privileges.

Apache HBase hbase

hbase

The Master and the RegionServer


processes run as this user.

HDFS

hdfs

hdfs

impala

The NameNode and DataNodes run as this


user, and the HDFS root directory as well
as the directories used for edit logs should
be owned by it.

Apache Hive

hive

hive

impala

The HiveServer2 process and the Hive


Metastore processes run as this user.
A user must be defined for Hive access to
its Metastore DB (e.g. MySQL or Postgres)
but it can be any identifier and does not
correspond to a Unix uid. This is
javax.jdo.option.ConnectionUserName

in hive-site.xml.
Apache
HCatalog

hive

hive

The WebHCat service (for REST access to


Hive functionality) runs as the hive user.
It is not configurable.

HttpFS

httpfs

httpfs

The HttpFS service runs as this user.


*See HttpFS Security Configuration for
instructions on how to generate the
merged httpfs-http.keytab file.

Hue

hue

hue

Hue runs as this user. It is not


configurable.

Cloudera
Impala

impala

impala

An interactive query tool.

Llama

llama

llama

Apache
Mahout
MapReduce

No special users.

mapred

mapred

Without Kerberos, the JobTracker and


tasks run as this user. The
LinuxTaskController binary is owned by
CDH 5 Security Guide | 15

Hadoop Users in CDH 5


Project

Unix User ID

Primary Group

Group Members Notes


this user for Kerberos. It would be
complicated to use a different user ID.

Apache Oozie

oozie

The Oozie service runs as this user.

oozie

Parquet

No special users.

Apache Pig

No special users.

Cloudera
Search

solr

The Solr process runs as this user. It is not


configurable.

Apache Spark spark

spark

The Spark process runs as this user. It is


not configurable.

Apache Sentry sentry


(incubating)

sentry

The Sentry service runs as this user.

Apache Sqoop sqoop

sqoop

This user is only for the Sqoop1 Metastore,


a configuration option that is not
recommended.

Apache
Sqoop2

sqoop

The Sqoop2 service runs as this user.

solr

sqoop2

Apache Whirr

No special users.

YARN

yarn

yarn

Without Kerberos, all YARN services and


applications run as this user. The
LinuxContainerExecutor binary is owned
by this user for Kerberos. It would be
complicated to use a different user ID.

Apache
ZooKeeper

zookeeper

zookeeper

The ZooKeeper process runs as this user.


It is not configurable.

Other

hadoop

yarn, hdfs,
mapred

This is a group with no associated Unix


user ID or keytab.

Note:
The Kerberos principal names should be of the format,
username/[email protected], where the term username refers to
the username of an existing UNIX account, such as hdfs or mapred. The table below lists the usernames

to be used for the Kerberos principal names. For example, the Kerberos principal for Apache Flume
would be flume/[email protected].
Table 2: CDH 5 Keytabs and Keytab File Permissions
Project (UNIX ID) Service

Kerberos
Principal
Primary

Filename
(.keytab)

Keytab File Keytab File File


Owner
Group
Permission
(octal)

Flume (flume)

flume-AGENT

flume

flume

flume

flume

600

HBase (hbase)

hbase-REGIONSERVER

hbase

hbase

hbase

hbase

600

16 | CDH 5 Security Guide

Hadoop Users in CDH 5


Project (UNIX ID) Service

Kerberos
Principal
Primary

Filename
(.keytab)

Keytab File Keytab File File


Owner
Group
Permission
(octal)

hdfs

hdfs

hdfs

hdfs

600

hive

hive

600

hbaseHBASETHRIFTSERVER
hbase- HBASERESTSERVER
hbase-MASTER
HDFS (hdfs)

hdfs-NAMENODE

Secondary:
Merge hdfs
and HTTP

hdfs-DATANODE
hdfsSECONDARYNAMENODE
Hive (hive)

hive-HIVESERVER2

hive

hive

hive-WEBHCAT

HTTP

HTTP

hive-HIVEMETASTORE

hive

hive

HttpFS (httpfs) hdfs-HTTPFS

httpfs

httpfs

httpfs

httpfs

600

Hue (hue)

hue-KT_RENEWER

hue

hue

hue

hue

600

Impala (impala)

impala-STATESTORE

impala

impala

impala

impala

600

llama

llama

llama

llama

600

mapred

hadoop

600

oozie

oozie

600

solr

solr

600

impala-CATALOGSERVER
impala-IMPALAD
Llama (llama)

impala-LLAMA

Secondary:
Merge
llama and
HTTP
MapReduce
(mapred)

mapreduce-JOBTRACKER

Oozie (oozie)

oozie-OOZIE_SERVER

mapred

mapred
Secondary:
Merge
mapred
and HTTP

mapreduce- TASKTRACKER

oozie

oozie
Secondary:
Merge
oozie and
HTTP

Search (solr)

solr-SOLR_SERVER

solr

solr
Secondary:
Merge solr
and HTTP

Sentry (sentry)

sentry-SENTRY_SERVER

sentry

sentry

sentry

sentry

600

Spark (spark)

spark_on_yarn-SPARK
_YARN_HISTORY_SERVER

spark

spark

spark

spark

600

CDH 5 Security Guide | 17

Hadoop Users in CDH 5


Project (UNIX ID) Service

Kerberos
Principal
Primary

Filename
(.keytab)

Keytab File Keytab File File


Owner
Group
Permission
(octal)

yarn

yarn

yarn

Sqoop (sqoop)
Sqoop2 (sqoop2)
YARN (yarn)

yarn-NODEMANAGER
yarn- RESOURCEMANAGER
yarn-JOBHISTORY

ZooKeeper
(zookeeper)

18 | CDH 5 Security Guide

zookeeper-server

Secondary:
Merge yarn
and HTTP

hadoop

644
600
600

zookeeper zookeeper zookeeper zookeeper 600

Configuring Hadoop Security in CDH 5

Configuring Hadoop Security in CDH 5


Important:
These instructions assume you know how to install and configure Kerberos, you already have a
working Kerberos Key Distribution Center (KDC) and realm setup, and that you've installed the Kerberos
user packages on all cluster machines and machines which will be used to access the cluster.
Furthermore, Oozie and Hue require that the realm support renewable tickets. For more information
about installing and configuring Kerberos, see:

MIT Kerberos Home


MIT Kerberos Documentation
Kerberos Explained
Microsoft Kerberos Overview
Microsoft Kerberos in Windows Server 2008
Microsoft Kerberos in Windows Server 2003

Here are the general steps to configuring secure Hadoop, each of which is described in more detail in the following
sections:
1. Install CDH 5.
2. Verify User Accounts and Groups in CDH 5 Due to Security.
3. If you are Using AES-256 Encryption, install the JCE Policy File.
4. Create and Deploy the Kerberos Principals and Keytab Files.
5. Shut Down the Cluster.
6. Enable Hadoop security.
7. Configure secure HDFS.
8. Optional: Configuring Security for HDFS High Availability.
9. Optional: Configuring secure WebHDFS.
10. Optional: Configuring secure NFS
11. Set Variables for Secure DataNodes.
12. Start up the NameNode.
13. Start up a DataNode.
14. Set the Sticky Bit on HDFS Directories.
15. Start up the Secondary NameNode (if used).
16. Configure Either MRv1 Security or YARN Security.
Note:
Kerberos security in CDH 5 has been tested with the following version of MIT Kerberos 5:
krb5-1.6.1 on Red Hat Enterprise Linux 5 and CentOS 5
Kerberos security in CDH 5 is supported with the following versions of MIT Kerberos 5:

krb5-1.6.3 on SUSE Linux Enterprise Server (SLES) 11 Service Pack 1


krb5-1.8.1 on Ubuntu
krb5-1.8.2 on Red Hat Enterprise Linux 6 and CentOS 6
krb5-1.9 on Red Hat Enterprise Linux 6.1

CDH 5 Security Guide | 19

Configuring Hadoop Security in CDH 5


Note:
If you want to enable Kerberos SPNEGO-based authentication for the Hadoop web interfaces, see
the Hadoop Auth, Java HTTP SPNEGO Documentation.

Step 1: Install CDH 5


Cloudera strongly recommends that you set up a fully-functional CDH 5 cluster before you begin configuring it
to use Hadoop's security features. When a secure Hadoop cluster is not configured correctly, the resulting error
messages are in a preliminary state, so it's best to start implementing security after you are sure your Hadoop
cluster is working properly without security.
For information about installing and configuring Hadoop and CDH 5 components, and deploying on a cluster,
see the CDH 5 Installation Guide

Step 2: Verify User Accounts and Groups in CDH 5 Due to Security


Note:
CDH 5 introduces a new version of MapReduce: MapReduce 2.0 (MRv2) built on the YARN framework.
In this document, we refer to this new version as YARN. CDH 5 also provides an implementation of
the previous version of MapReduce, referred to as MRv1 in this document.
If you are using MRv1, see Step 2a (MRv1 only): Verify User Accounts and Groups in MRv1 on page 20 for
configuration information. Or
If you are using YARN, see Step 2b (YARN only): Verify User Accounts and Groups in YARN for configuration
information.

Step 2a (MRv1 only): Verify User Accounts and Groups in MRv1


Note:
If you are using YARN, skip this step and proceed to Step 2b (YARN only): Verify User Accounts and
Groups in YARN.
During CDH 5 package installation of MRv1, the following Unix user accounts are automatically created to support
security:
This User

Runs These Hadoop Programs

hdfs

HDFS: NameNode, DataNodes, Secondary NameNode


(or Standby NameNode if you are using HA)

mapred

MRv1: JobTracker and TaskTrackers

The hdfs user also acts as the HDFS superuser.


The hadoop user no longer exists in CDH 5. If you currently use the hadoop user to run applications as an HDFS
super-user, you should instead use the new hdfs user, or create a separate Unix account for your application
such as myhadoopapp.

20 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


MRv1: Directory Ownership in the Local File System
Because the HDFS and MapReduce services run as different users, you must be sure to configure the correct
directory ownership of the following files on the local file system of each host:
File System

Directory

Owner

Local

dfs.namenode.name.dir hdfs:hdfs
(dfs.name.dir is

Permissions
drwx------

deprecated but will also


work)
Local

dfs.datanode.data.dir hdfs:hdfs
(dfs.data.dir is

drwx------

deprecated but will also


work)
Local

mapred.local.dir

mapred:mapred

drwxr-xr-x

See also Deploying MapReduce v1 (MRv1) on a Cluster.


You must also configure the following permissions for the HDFS and MapReduce log directories (the default
locations in /var/log/hadoop-hdfs and /var/log/hadoop-0.20-mapreduce), and the
$MAPRED_LOG_DIR/userlogs/ directory:
File System

Directory

Owner

Permissions

Local

HDFS_LOG_DIR

hdfs:hdfs

drwxrwxr-x

Local

MAPRED_LOG_DIR

mapred:mapred

drwxrwxr-x

Local

userlogs directory in
MAPRED_LOG_DIR

mapred:anygroup

permissions will be set


automatically at daemon
start time

MRv1: Directory Ownership on HDFS


The following directories on HDFS must also be configured as follows:
File System

Directory

Owner

HDFS

mapreduce.jobtracker.system.dir mapred:hadoop
(mapred.system.dir is

Permissions
drwx------

deprecated but will also


work)
HDFS

1
2

/ (root directory)

hdfs:hadoop

drwxr-xr-x

In CDH 5, package installation and the Hadoop daemons will automatically configure the correct permissions
for you if you configure the directory ownership correctly as shown in the table above.
When starting up, MapReduce sets the permissions for the mapreduce.jobtracker.system.dir (or
mapred.system.dir) directory in HDFS, assuming the user mapred owns that directory.
CDH 5 Security Guide | 21

Configuring Hadoop Security in CDH 5


MRv1: Changing the Directory Ownership on HDFS
If Hadoop security is enabled, use kinit hdfs to obtain Kerberos credentials for the hdfs user by running
the following commands before changing the directory ownership on HDFS:
$ sudo -u hdfs kinit -k -t hdfs.keytab hdfs/[email protected]

If kinit hdfs does not work initially, run kinit -R after running kinit to obtain credentials. (For more
information, see Problem 2 in Appendix A - Troubleshooting). To change the directory ownership on HDFS, run
the following commands. Replace the example /mapred/system directory in the commands below with the
HDFS directory specified by the mapreduce.jobtracker.system.dir (or mapred.system.dir) property in
the conf/mapred-site.xml file:
$
$
$
$

sudo
sudo
sudo
sudo

-u
-u
-u
-u

hdfs
hdfs
hdfs
hdfs

hadoop
hadoop
hadoop
hadoop

fs
fs
fs
fs

-chown
-chown
-chmod
-chmod

mapred:hadoop /mapred/system
hdfs:hadoop /
-R 700 /mapred/system
755 /

In addition (whether or not Hadoop security is enabled) create the /tmp directory. For instructions on creating
/tmp and setting its permissions, see these instructions.

Step 2b (YARN only): Verify User Accounts and Groups in YARN


Note:
If you are using MRv1, skip this step and proceed to Step 3: If you are Using AES-256 Encryption,
install the JCE Policy File.
During CDH 5 package installation of MapReduce 2.0 (YARN), the following Unix user accounts are automatically
created to support security:
This User

Runs These Hadoop Programs

hdfs

HDFS: NameNode, DataNodes, Standby NameNode (if you are using HA)

yarn

YARN: ResourceManager, NodeManager

mapred

YARN: MapReduce Job History Server

Important:
The HDFS and YARN daemons must run as different Unix users; for example, hdfs and yarn. The
MapReduce Job History server must run as user mapred. Having all of these users share a common
Unix group is recommended; for example, hadoop.

YARN: Directory Ownership in the Local File System


Because the HDFS and MapReduce services run as different users, you must be sure to configure the correct
directory ownership of the following files on the local file system of each host:
File System

Directory

Local

dfs.namenode.name.dir hdfs:hdfs
(dfs.name.dir is

deprecated but will also


work)
22 | CDH 5 Security Guide

Owner

Permissions (see Footnote


1)
drwx------

Configuring Hadoop Security in CDH 5


File System

Directory

Owner

Local

dfs.datanode.data.dir hdfs:hdfs
(dfs.data.dir is

Permissions (see Footnote


1)
drwx------

deprecated but will also


work)
Local

yarn.nodemanager.local-dirs yarn:yarn

drwxr-xr-x

Local

yarn.nodemanager.log-dirs yarn:yarn

drwxr-xr-x

Local

container-executor

root:yarn

--Sr-s---

Local

conf/container-executor.cfg root:yarn

r--------

You must also configure the following permissions for the HDFS, YARN and MapReduce log directories (the
default locations in /var/log/hadoop-hdfs, /var/log/hadoop-yarn and /var/log/hadoop-mapreduce):
File System

Directory

Owner

Permissions

Local

HDFS_LOG_DIR

hdfs:hdfs

drwxrwxr-x

Local

$YARN_LOG_DIR

yarn:yarn

drwxrwxr-x

Local

MAPRED_LOG_DIR

mapred:mapred

drwxrwxr-x

YARN: Directory Ownership on HDFS


The following directories on HDFS must also be configured as follows:
File System

Directory

Owner

Permissions

HDFS

/ (root directory)

hdfs:hadoop drwxr-xr-x

HDFS

yarn.nodemanager.remote-app-log-dir

yarn:hadoop drwxrwxrwxt

HDFS

mapreduce.jobhistory.intermediate-done-dir

mapred:hadoop drwxrwxrwxt

HDFS

mapreduce.jobhistory.done-dir

mapred:hadoop drwxr-x---

YARN: Changing the Directory Ownership on HDFS


If Hadoop security is enabled, use kinit hdfs to obtain Kerberos credentials for the hdfs user by running
the following commands:
$ sudo -u hdfs kinit -k -t hdfs.keytab hdfs/[email protected]
$ hadoop fs -chown hdfs:hadoop /
$ hadoop fs -chmod 755 /

In CDH 5, package installation and the Hadoop daemons will automatically configure the correct permissions
for you if you configure the directory ownership correctly as shown in the two tables above. See also Deploying
MapReduce v2 (YARN) on a Cluster.
CDH 5 Security Guide | 23

Configuring Hadoop Security in CDH 5


If kinit hdfs does not work initially, run kinit -R after running kinit to obtain credentials. (See Problem 2
in Appendix A - Troubleshooting. To change the directory ownership on HDFS, run the following commands:
$ sudo -u hdfs hadoop fs -chown hdfs:hadoop /
$ sudo -u hdfs hadoop fs -chmod 755 /
$ sudo -u hdfs hadoop fs -chown yarn:hadoop [yarn.nodemanager.remote-app-log-dir]
$ sudo -u hdfs hadoop fs -chmod 1777 [yarn.nodemanager.remote-app-log-dir]
$ sudo -u hdfs hadoop fs -chown mapred:hadoop
[mapreduce.jobhistory.intermediate-done-dir]
$ sudo -u hdfs hadoop fs -chmod 1777 [mapreduce.jobhistory.intermediate-done-dir]
$ sudo -u hdfs hadoop fs -chown mapred:hadoop [mapreduce.jobhistory.done-dir]
$ sudo -u hdfs hadoop fs -chmod 750 [mapreduce.jobhistory.done-dir]

In addition (whether or not Hadoop security is enabled) create the /tmp directory. For instructions on creating
/tmp and setting its permissions, see these instructions
In addition (whether or not Hadoop security is enabled), change permissions on the /user/history Directory.
See these instructions here.

Step 3: If you are Using AES-256 Encryption, install the JCE Policy File
If you are using CentOS/Red Hat Enterprise Linux 5.6 or later, or Ubuntu, which use AES-256 encryption by
default for tickets, you must install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy
File on all cluster and Hadoop user machines. For JCE Policy File installation instructions, see the README.txt
file included in the jce_policy-x.zip file.
Alternatively, you can configure Kerberos to not use AES-256 by removing aes256-cts:normal from the
supported_enctypes field of the kdc.conf or krb5.conf file. Note that after changing the kdc.conf file, you'll
need to restart both the KDC and the kadmin server for those changes to take affect. You may also need to
recreate or change the password of the relevant principals, including potentially the Ticket Granting Ticket
principal (krbtgt/REALM@REALM). If AES-256 is still used after all of those steps, it's because the
aes256-cts:normal setting existed when the Kerberos database was created. To fix this, create a new Kerberos
database and then restart both the KDC and the kadmin server.
To verify the type of encryption used in your cluster:
1. On the local KDC host, type this command to create a test principal:
$ kadmin -q "addprinc test"

2. On a cluster host, type this command to start a Kerberos session as test:


$ kinit test

3. On a cluster host, type this command to view the encryption type in use:
$ klist -e

If AES is being used, output like the following is displayed after you type the klist command (note that
AES-256 is included in the output):
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: test@SCM
Valid starting
Expires
Service principal
05/19/11 13:25:04 05/20/11 13:25:04 krbtgt/SCM@SCM
Etype (skey, tkt): AES-256 CTS mode with 96-bit SHA-1 HMAC, AES-256 CTS mode
with 96-bit SHA-1 HMAC

24 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5

Step 4: Create and Deploy the Kerberos Principals and Keytab Files
A Kerberos principal is used in a Kerberos-secured system to represent a unique identity. Kerberos assigns
tickets to Kerberos principals to enable them to access Kerberos-secured Hadoop services. For Hadoop, the
principals should be of the format username/[email protected]. In this guide,
the term username in the username/[email protected] principal refers to
the username of an existing Unix account, such as hdfs or mapred.
A keytab is a file containing pairs of Kerberos principals and an encrypted copy of that principal's key. The keytab
files are unique to each host since their keys include the hostname. This file is used to authenticate a principal
on a host to Kerberos without human interaction or storing a password in a plain text file. Because having access
to the keytab file for a principal allows one to act as that principal, access to the keytab files should be tightly
secured. They should be readable by a minimal set of users, should be stored on local disk, and should not be
included in machine backups, unless access to those backups is as secure as access to the local machine.
Important:
For both MRv1 and YARN deployments: On every machine in your cluster, there must be a keytab file
for the hdfs user and a keytab file for the mapred user. The hdfs keytab file must contain entries for
the hdfs principal and a HTTP principal, and the mapred keytab file must contain entries for the mapred
principal and a HTTP principal. On each respective machine, the HTTP principal will be the same in
both keytab files.
In addition, for YARN deployments only: On every machine in your cluster, there must be a keytab file
for the yarn user. The yarn keytab file must contain entries for the yarn principal and a HTTP principal.
On each respective machine, the HTTP principal in the yarn keytab file will be the same as the HTTP
principal in the hdfs and mapred keytab files.

Note:
The following instructions illustrate an example of creating keytab files for MIT Kerberos. If you are
using another version of Kerberos, refer to your Kerberos documentation for instructions. You may
use either kadmin or kadmin.local to run these commands.

When to Use kadmin.local and kadmin


When creating the Kerberos principals and keytabs, you can use kadmin.local or kadmin depending on your
access and account:
If you have root access to the KDC machine, but you don't have a Kerberos admin account, use kadmin.local.
If you don't have root access to the KDC machine, but you do have a Kerberos admin account, use kadmin.
If you have both root access to the KDC machine and a Kerberos admin account, you can use either one.
To start kadmin.local (on the KDC machine) or kadmin from any machine, run this command:
$ sudo kadmin.local

OR:
$ kadmin

Note:
In this guide, kadmin is shown as the prompt for commands in the kadmin shell, but you can type
the same commands at the kadmin.local prompt in the kadmin.local shell.

CDH 5 Security Guide | 25

Configuring Hadoop Security in CDH 5


Note:
Running kadmin.local may prompt you for a password because it is being run via sudo. You should
provide your Unix password. Running kadmin may prompt you for a password because you need
Kerberos admin privileges. You should provide your Kerberos admin password.

To create the Kerberos principals


Important:
If you plan to use Oozie, Impala, or the Hue Kerberos ticket renewer in your cluster, you must configure
your KDC to allow tickets to be renewed, and you must configure krb5.conf to request renewable
tickets. Typically, you can do this by adding the max_renewable_life setting to your realm in
kdc.conf, and by adding the renew_lifetime parameter to the libdefaults section of krb5.conf.
For more information about renewable tickets, see the Kerberos documentation.
Do the following steps for every host in your cluster. Run the commands in the kadmin.local or kadmin shell,
replacing the fully.qualified.domain.name in the commands with the fully qualified domain name of each
host. Replace YOUR-REALM.COM with the name of the Kerberos realm your Hadoop cluster is in.
1. In the kadmin.local or kadmin shell, create the hdfs principal. This principal is used for the NameNode,
Secondary NameNode, and DataNodes.
kadmin:

addprinc -randkey hdfs/[email protected]

Note:
If your Kerberos administrator or company has a policy about principal names that does not allow
you to use the format shown above, you can work around that issue by configuring the <kerberos
principal> to <short name> mapping that is built into Hadoop. For more information, see
Appendix C - Configuring the Mapping from Kerberos Principals to Short Names.
2. Create the mapred principal. If you are using MRv1, the mapred principal is used for the JobTracker and
TaskTrackers. If you are using YARN, the mapred principal is used for the MapReduce Job History Server.
kadmin:

addprinc -randkey mapred/[email protected]

3. YARN only: Create the yarn principal. This principal is used for the ResourceManager and NodeManager.
kadmin:

addprinc -randkey yarn/[email protected]

4. Create the HTTP principal.


kadmin:

addprinc -randkey HTTP/[email protected]

Important:
The HTTP principal must be in the format
HTTP/[email protected]. The first component of the principal

must be the literal string "HTTP". This format is standard for HTTP principals in SPNEGO and is
hard-coded in Hadoop. It cannot be deviated from.

26 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


To create the Kerberos keytab files
Important:
The instructions in this section for creating keytab files require using the Kerberos norandkey option
in the xst command. If your version of Kerberos does not support the norandkey option, or if you
cannot use kadmin.local, then use these alternate instructions in Appendix F to create appropriate
Kerberos keytab files. After using those alternate instructions to create the keytab files, continue
with the next section To deploy the Kerberos keytab files.
Do the following steps for every host in your cluster. Run the commands in the kadmin.local or kadmin shell,
replacing the fully.qualified.domain.name in the commands with the fully qualified domain name of each
host:
1. Create the hdfs keytab file that will contain the hdfs principal and HTTP principal. This keytab file is used
for the NameNode, Secondary NameNode, and DataNodes.
kadmin: xst -norandkey -k hdfs.keytab hdfs/fully.qualified.domain.name
HTTP/fully.qualified.domain.name

2. Create the mapred keytab file that will contain the mapred principal and HTTP principal. If you are using MRv1,
the mapred keytab file is used for the JobTracker and TaskTrackers. If you are using YARN, the mapred keytab
file is used for the MapReduce Job History Server.
kadmin: xst -norandkey -k mapred.keytab mapred/fully.qualified.domain.name
HTTP/fully.qualified.domain.name

3. YARN only: Create the yarn keytab file that will contain the yarn principal and HTTP principal. This keytab
file is used for the ResourceManager and NodeManager.
kadmin: xst -norandkey -k yarn.keytab yarn/fully.qualified.domain.name
HTTP/fully.qualified.domain.name

4. Use klist to display the keytab file entries; a correctly-created hdfs keytab file should look something like
this:
$ klist -e -k -t hdfs.keytab
Keytab name: WRFILE:hdfs.keytab
slot KVNO Principal
---- ---- --------------------------------------------------------------------1
7
HTTP/[email protected] (DES cbc mode with
CRC-32)
2
7
HTTP/[email protected] (Triple DES cbc mode
with HMAC/sha1)
3
7
hdfs/[email protected] (DES cbc mode with
CRC-32)
4
7
hdfs/[email protected] (Triple DES cbc mode
with HMAC/sha1)

5. Continue with the next section To deploy the Kerberos keytab files.

To deploy the Kerberos keytab files


On every node in the cluster, repeat the following steps to deploy the hdfs.keytab and mapred.keytab files.
If you are using YARN, you will also deploy the yarn.keytab file.
1. On the host machine, copy or move the keytab files to a directory that Hadoop can access, such as
/etc/hadoop/conf.

CDH 5 Security Guide | 27

Configuring Hadoop Security in CDH 5


a. If you are using MRv1:
$ sudo mv hdfs.keytab mapred.keytab /etc/hadoop/conf/

If you are using YARN:


$ sudo mv hdfs.keytab mapred.keytab yarn.keytab /etc/hadoop/conf/

b. Make sure that the hdfs.keytab file is only readable by the hdfs user, and that the mapred.keytab file
is only readable by the mapred user.
$ sudo chown hdfs:hadoop /etc/hadoop/conf/hdfs.keytab
$ sudo chown mapred:hadoop /etc/hadoop/conf/mapred.keytab
$ sudo chmod 400 /etc/hadoop/conf/*.keytab

Note:
To enable you to use the same configuration files on every host, Cloudera recommends that
you use the same name for the keytab files on every host.
c. YARN only: Make sure that the yarn.keytab file is only readable by the yarn user.
$ sudo chown yarn:hadoop /etc/hadoop/conf/yarn.keytab
$ sudo chmod 400 /etc/hadoop/conf/yarn.keytab

Important:
If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, HttpFS, or Oozie
services are configured to use Kerberos HTTP SPNEGO authentication, and two or more of these
services are running on the same host, then all of the running services must use the same
HTTP principal and keytab file used for their HTTP endpoints.

Step 5: Shut Down the Cluster


To enable security in CDH, you must stop all Hadoop daemons in your cluster and then change some configuration
properties. You must stop all daemons in the cluster because after one Hadoop daemon has been restarted
with the configuration properties set to enable security, daemons running without security enabled will be
unable to communicate with that daemon. This requirement to shut down all daemons makes it impossible to
do a rolling upgrade to enable security on a Hadoop cluster.
To shut down the cluster, run the following command on every node in your cluster (as root):
$ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done

Step 6: Enable Hadoop Security


Cloudera recommends that all of the Hadoop configuration files throughout the cluster have the same contents.
To enable Hadoop security, add the following properties to the core-site.xml file on every machine in the
cluster:
<property>
<name>hadoop.security.authentication</name>

28 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


<value>kerberos</value> <!-- A value of "simple" would disable security. -->
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>

Enabling Service-Level Authorization for Hadoop Services


The hadoop-policy.xml file maintains access control lists (ACL) for Hadoop services. Each ACL consists of
comma-separated lists of users and groups separated by a space. For example:
user_a,user_b group_a,group_b

If you only want to specify a set of users, add a comma-separated list of users followed by a blank space. Similarly,
to specify only authorized groups, use a blank space at the beginning. A * can be used to give access to all users.
For example, to give users, ann, bob, and groups, group_a, group_b access to Hadoop's DataNodeProtocol service,
modify the security.datanode.protocol.acl property in hadoop-policy.xml. Similarly, to give all users
access to the InterTrackerProtocol service, modify security.inter.tracker.protocol.acl as follows:
<property>
<name>security.datanode.protocol.acl</name>
<value>ann,bob group_a,group_b</value>
<description>ACL for DatanodeProtocol, which is used by datanodes to
communicate with the namenode.</description>
</property>
<property>
<name>security.inter.tracker.protocol.acl</name>
<value>*</value>
<description>ACL for InterTrackerProtocol, which is used by tasktrackers to
communicate with the jobtracker.</description>
</property>

For more details, see Service-Level Authorization in Hadoop.

Step 7: Configure Secure HDFS


When following the instructions in this section to configure the properties in the hdfs-site.xml file, keep the
following important guidelines in mind:
The properties for each daemon (NameNode, Secondary NameNode, and DataNode) must specify both the
HDFS and HTTP principals, as well as the path to the HDFS keytab file.
The Kerberos principals for the NameNode, Secondary NameNode, and DataNode are configured in the
hdfs-site.xml file. The same hdfs-site.xml file with all three of these principals must be installed on
every host machine in the cluster. That is, it is not sufficient to have the NameNode principal configured on
the NameNode host machine only. This is because, for example, the DataNode must know the principal name
of the NameNode in order to send heartbeats to it. Kerberos authentication is bi-directional.
The special string _HOST in the properties is replaced at run-time by the fully-qualified domain name of the
host machine where the daemon is running. This requires that reverse DNS is properly working on all the
hosts configured this way. You may use _HOST only as the entirety of the second component of a principal
name. For example, hdfs/[email protected] is valid, but [email protected] and
hdfs/[email protected] are not.
When performing the _HOST substitution for the Kerberos principal names, the NameNode determines its
own hostname based on the configured value of fs.default.name, whereas the DataNodes determine their
hostnames based on the result of reverse DNS resolution on the DataNode hosts. Likewise, the JobTracker
uses the configured value of mapred.job.tracker to determine its hostname whereas the TaskTrackers,
like the DataNodes, use reverse DNS.
CDH 5 Security Guide | 29

Configuring Hadoop Security in CDH 5


The dfs.datanode.address and dfs.datanode.http.address port numbers for the DataNode must be
below 1024, because this provides part of the security mechanism to make it impossible for a user to run a
map task which impersonates a DataNode. The port numbers for the NameNode and Secondary NameNode
can be anything you want, but the default port numbers are good ones to use.

To configure secure HDFS


Add the following properties to the hdfs-site.xml file on every machine in the cluster. Replace these example
values shown below with the correct settings for your site: path to the HDFS keytab, YOUR-REALM.COM, fully
qualified domain name of NN, and fully qualified domain name of 2NN
<!-- General HDFS security config -->
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<!-- NameNode security config -->
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>hdfs/[email protected]</value>
</property>
<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
</property>
<!-- Secondary NameNode security config -->
<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
<value>hdfs/[email protected]</value>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
</property>
<!-- DataNode security config -->
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hdfs/[email protected]</value>
</property>
<!-- Web Authentication config -->
<property>
<name>dfs.web.authentication.kerberos.principal</name>

30 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


<value>HTTP/_HOST@YOUR_REALM</value>
</property>

To enable SSL for HDFS


Add the following property to hdfs-site.xml on every machine in your cluster.
<property>
<name>dfs.http.policy</name>
<value>HTTPS_ONLY</value>
</property>

Optional Step 8: Configuring Security for HDFS High Availability


CDH 5 supports the HDFS High Availability (HA) feature with Kerberos security enabled. There are two use cases
that affect security for HA:
If you are not using Quorum-based Storage (see Software Configuration for Quorum-based Storage), then
no extra configuration for HA is necessary if automatic failover is not enabled. If automatic failover is enabled
then access to ZooKeeper should be secured. See the Software Configuration for Shared Storage Using NFS
documentation for details.
If you are using Quorum-based Storage, then you must configure security for Quorum-based Storage by
following the instructions in this section.
To configure security for Quorum-based Storage:
Add the following Quorum-based Storage configuration properties to the hdfs-site.xml file on all of the
machines in the cluster:
<property>
<name>dfs.journalnode.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value> <!-- path to the HDFS keytab -->
</property>
<property>
<name>dfs.journalnode.kerberos.principal</name>
<value>hdfs/[email protected]</value>
</property>
<property>
<name>dfs.journalnode.kerberos.internal.spnego.principal</name>
<value>HTTP/[email protected]</value>
</property>

Note:
If you already have principals and keytabs created for the machines where the JournalNodes are
running, then you should reuse those principals and keytabs in the configuration properties above.
You will likely have these principals and keytabs already created if you are collocating a JournalNode
on a machine with another HDFS daemon.

Optional Step 9: Configure secure WebHDFS


Note:
If you are not using WebHDFS, you can skip this step.
Security for WebHDFS is disabled by default. If you want use WebHDFS with a secure cluster, this is the time to
enable and configure it.
CDH 5 Security Guide | 31

Configuring Hadoop Security in CDH 5


To configure secure WebHDFS:
1. If you have not already done so, enable WebHDFS by adding the following property to the hdfs-site.xml
file on every machine in the cluster".
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

2. Add the following properties to the hdfs-site.xml file on every machine in the cluster. Replace the example
values shown below with the correct settings for your site.
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/[email protected]</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/hadoop/conf/HTTP.keytab</value> <!-- path to the HTTP keytab -->
</property>

Optional Step 10: Configuring a secure HDFS NFS Gateway


To deploy a Kerberized HDFS NFS gateway, add the following configuration properties to hdfs-site.xml on
the NFS server.
<property>
<name>dfs.nfs.keytab.file</name>
<value>/etc/hadoop/conf/hdfs.keytab</value> <!-- path to the HDFS or NFS gateway keytab
-->
</property>
<property>
<name>dfs.nfs.kerberos.principal</name>
<value>hdfs/[email protected]</value>
</property>

Step 11: Set Variables for Secure DataNodes


In order to allow DataNodes to start on a secure Hadoop cluster, you must set the following variables on all
DataNodes in /etc/default/hadoop-hdfs-datanode.
export
export
export
export

HADOOP_SECURE_DN_USER=hdfs
HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs
HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs
JSVC_HOME=/usr/lib/bigtop-utils/

Note:
Depending on the version of Linux you are using, you may not have the /usr/lib/bigtop-utils
directory on your system. If that is the case, set the JSVC_HOME variable to the
/usr/libexec/bigtop-utils directory by using this command:
export JSVC_HOME=/usr/libexec/bigtop-utils

32 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5

Step 12: Start up the NameNode


You are now ready to start the NameNode. Use the service command to run the /etc/init.d script.
$ sudo service hadoop-hdfs-namenode start

You'll see some extra information in the logs such as:


10/10/25 17:01:46 INFO security.UserGroupInformation:
Login successful for user hdfs/[email protected] using keytab
file /etc/hadoop/conf/hdfs.keytab

and:
12/05/23 18:18:31 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to
getDelegationToken
12/05/23 18:18:31 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to
renewDelegationToken
12/05/23 18:18:31 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to
cancelDelegationToken
12/05/23 18:18:31 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to fsck
12/05/23 18:18:31 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to getimage
12/05/23 18:18:31 INFO http.HttpServer: Jetty bound to port 50070
12/05/23 18:18:31 INFO mortbay.log: jetty-6.1.26
12/05/23 18:18:31 INFO server.KerberosAuthenticationHandler: Login using keytab
/etc/hadoop/conf/hdfs.keytab, for principal
HTTP/[email protected]
12/05/23 18:18:31 INFO server.KerberosAuthenticationHandler: Initialized, principal
[HTTP/[email protected]] from keytab
[/etc/hadoop/conf/hdfs.keytab]

You can verify that the NameNode is working properly by opening a web browser to http://machine:50070/
where machine is the name of the machine where the NameNode is running.
Cloudera also recommends testing that the NameNode is working properly by performing a metadata-only
HDFS operation, which will now require correct Kerberos credentials. For example:
$ hadoop fs -ls

Information about the kinit Command


Important:
Running the hadoop fs -ls command will fail if you do not have a valid Kerberos ticket in your
credentials cache. You can examine the Kerberos tickets currently in your credentials cache by running
the klist command. You can obtain a ticket by running the kinit command and either specifying
a keytab file containing credentials, or entering the password for your principal. If you do not have a
valid ticket, you will receive an error such as:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to
the server : javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020
failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException:
No valid credentials provided (Mechanism level: Failed to find any Kerberos
tgt)]

CDH 5 Security Guide | 33

Configuring Hadoop Security in CDH 5


Note:
The kinit command must either be on the path for user accounts running the Hadoop client, or else
the hadoop.kerberos.kinit.command parameter in core-site.xml must be manually configured
to the absolute path to the kinit command.

Note:
If you are running MIT Kerberos 1.8.1 or higher, a bug in versions of the Oracle JDK 6 Update 26 and
earlier causes Java to be unable to read the Kerberos credentials cache even after you have successfully
obtained a Kerberos ticket using kinit. To workaround this bug, run kinit -R after running kinit
initially to obtain credentials. Doing so will cause the ticket to be renewed, and the credentials cache
rewritten in a format which Java can read. For more information about this problem, see Problem 2
in Appendix A - Troubleshooting.

Step 12: Start up a DataNode


Begin by starting one DataNode only to make sure it can properly connect to the NameNode. Use the service
command to run the /etc/init.d script.
$ sudo service hadoop-hdfs-datanode start

You'll see some extra information in the logs such as:


10/10/25 17:21:41 INFO security.UserGroupInformation:
Login successful for user hdfs/[email protected] using keytab
file /etc/hadoop/conf/hdfs.keytab

If you can get a single DataNode running and you can see it registering with the NameNode in the logs, then
start up all the DataNodes. You should now be able to do all HDFS operations.

Step 14: Set the Sticky Bit on HDFS Directories


This step is optional but strongly recommended for security. In CDH 5, HDFS file permissions have support for
the sticky bit. The sticky bit can be set on directories, preventing anyone except the superuser, directory owner,
or file owner from deleting or moving the files within the directory. Setting the sticky bit for a file has no effect.
This is useful for directories such as /tmp which previously had to be set to be world-writable. To set the sticky
bit on the /tmp directory, run the following command:
$ sudo -u hdfs kinit -k -t hdfs.keytab hdfs/[email protected]
$ sudo -u hdfs hadoop fs -chmod 1777 /tmp

After running this command, the permissions on /tmp will appear as shown below. (Note the "t" instead of the
final "x".)
$ hadoop fs -ls /
Found 2 items
drwxrwxrwt - hdfs supergroup 0 2011-02-14 15:55 /tmp
drwxr-xr-x - hdfs supergroup 0 2011-02-14 14:01 /user

34 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5

Step 15: Start up the Secondary NameNode (if used)


At this point, you should be able to start the Secondary NameNode if you are using one:
$ sudo service hadoop-hdfs-secondarynamenode start

Note:
If you are using HDFS HA, do not use the Secondary NameNode. See Configuring HDFS High Availability
for instructions on configuring and deploying the Standby NameNode.
You'll see some extra information in the logs such as:
10/10/26 12:03:18 INFO security.UserGroupInformation:
Login successful for user hdfs/fully.qualified.domain.name@YOUR-REALM using keytab file
/etc/hadoop/conf/hdfs.keytab

and:
12/05/23 18:33:06 INFO http.HttpServer: Adding Kerberos (SPNEGO) filter to getimage
12/05/23 18:33:06 INFO http.HttpServer: Jetty bound to port 50090
12/05/23 18:33:06 INFO mortbay.log: jetty-6.1.26
12/05/23 18:33:06 INFO server.KerberosAuthenticationHandler: Login using keytab
/etc/hadoop/conf/hdfs.keytab, for principal
HTTP/[email protected]
12/05/23 18:33:06 INFO server.KerberosAuthenticationHandler: Initialized, principal
[HTTP/[email protected]] from keytab
[/etc/hadoop/conf/hdfs.keytab]

You should make sure that the Secondary NameNode not only starts, but that it is successfully checkpointing.
If you're using the service command to start the Secondary NameNode from the /etc/init.d scripts,
Cloudera recommends setting the property fs.checkpoint.period in the hdfs-site.xml file to a very low
value (such as 5), and then monitoring the Secondary NameNode logs for a successful startup and checkpoint.
Once you are satisfied that the Secondary NameNode is checkpointing properly, you should reset the
fs.checkpoint.period to a reasonable value, or return it to the default, and then restart the Secondary
NameNode.
You can make the Secondary NameNode perform a checkpoint by doing the following:
$ sudo -u hdfs hdfs secondarynamenode -checkpoint force

Note that this will not cause a running Secondary NameNode to checkpoint, but rather will start up a Secondary
NameNode that will immediately perform a checkpoint and then shut down. This can be useful for debugging.
Note:
If you encounter errors during Secondary NameNode checkpointing, it may be helpful to enable
Kerberos debugging output. For instructions, see Appendix D - Enabling Debugging Output for the
Sun Kerberos Classes.

Step 16: Configure Either MRv1 Security or YARN Security


At this point, you are ready to configure either MRv1 Security or YARN Security.
If you are using MRv1, do the steps in Configuring MRv1 Security to configure, start, and test secure MRv1.
If you are using YARN, do the steps in Configuring YARN Security to configure, start, and test secure YARN.
CDH 5 Security Guide | 35

Configuring Hadoop Security in CDH 5

Configuring MRv1 Security


If you are using YARN, skip this section and see Configuring YARN Security.
If you are using MRv1, do the following steps to configure, start, and test secure MRv1.
1.
2.
3.
4.

Step 1: Configure Secure MRv1 on page 36


Step 2: Start up the JobTracker on page 37
Step 3: Start up a TaskTracker on page 37
Step 4: Try Running a Map/Reduce Job on page 38

Step 1: Configure Secure MRv1


Keep the following important information in mind when configuring secure MapReduce:
The properties for Job Tracker and Task Tracker must specify the mapred principal, as well as the path to the
mapred keytab file.
The Kerberos principals for the Job Tracker and Task Tracker are configured in the mapred-site.xml file.
The same mapred-site.xml file with both of these principals must be installed on every host machine in
the cluster. That is, it is not sufficient to have the Job Tracker principal configured on the Job Tracker host
machine only. This is because, for example, the TaskTracker must know the principal name of the JobTracker
in order to securely register with the JobTracker. Kerberos authentication is bi-directional.
Do not use ${user.name} in the value of the mapred.local.dir or hadoop.log.dir properties in
mapred-site.xml. Doing so can prevent tasks from launching on a secure cluster.
Make sure that each user who will be running MRv1 jobs exists on all cluster nodes (that is, on every node
that hosts any MRv1 daemon).
Make sure the value specified for mapred.local.dir is identical in mapred-site.xml and
taskcontroller.cfg. If the values are different, this error message is returned.
Make sure the value specified in taskcontroller.cfg for hadoop.log.dir is the same as what the Hadoop
daemons are using, which is /var/log/hadoop-0.20-mapreduce by default and can be configured in
mapred-site.xml. If the values are different, this error message is returned.
To configure secure MapReduce:
1. Add the following properties to the mapred-site.xml file on every machine in the cluster:
<!-- JobTracker security configs -->
<property>
<name>mapreduce.jobtracker.kerberos.principal</name>
<value>mapred/[email protected]</value>
</property>
<property>
<name>mapreduce.jobtracker.keytab.file</name>
<value>/etc/hadoop/conf/mapred.keytab</value> <!-- path to the MapReduce keytab
-->
</property>
<!-- TaskTracker security configs -->
<property>
<name>mapreduce.tasktracker.kerberos.principal</name>
<value>mapred/[email protected]</value>
</property>
<property>
<name>mapreduce.tasktracker.keytab.file</name>
<value>/etc/hadoop/conf/mapred.keytab</value> <!-- path to the MapReduce keytab
-->
</property>
<!-- TaskController settings -->
<property>
<name>mapred.task.tracker.task-controller</name>
<value>org.apache.hadoop.mapred.LinuxTaskController</value>
</property>

36 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


<property>
<name>mapreduce.tasktracker.group</name>
<value>mapred</value>
</property>

2. Create a file called taskcontroller.cfg that contains the following information:


hadoop.log.dir=<Path to Hadoop log directory. Should be same value used to start
the TaskTracker. This is required to set proper permissions on the log files so
that they can be written to by the user's tasks and read by the TaskTracker for
serving on the web UI.>
mapreduce.tasktracker.group=mapred
banned.users=mapred,hdfs,bin
min.user.id=1000

Note:
In the taskcontroller.cfg file, the default setting for the banned.users property is mapred,
hdfs, and bin to prevent jobs from being submitted via those user accounts. The default setting
for the min.user.id property is 1000 to prevent jobs from being submitted with a user ID less
than 1000, which are conventionally Unix super users. Note that some operating systems such
as CentOS 5 use a default value of 500 and above for user IDs, not 1000. If this is the case on your
system, change the default setting for the min.user.id property to 500. If there are user accounts
on your cluster that have a user ID less than the value specified for the min.user.id property,
the TaskTracker returns an error code of 255.
3. The path to the taskcontroller.cfg file is determined relative to the location of the task-controller
binary. Specifically, the path is <path of task-controller binary>/../../conf/taskcontroller.cfg.
If you installed the CDH 5 package, this path will always correspond to
/etc/hadoop/conf/taskcontroller.cfg.
Note:
For more information about the task-controller program, see Appendix B - Information about
Other Hadoop Security Programs.

Important:
The same mapred-site.xml file and the same hdfs-site.xml file must both be installed on every
host machine in the cluster so that the NameNode, Secondary NameNode, DataNode, Job Tracker and
Task Tracker can all connect securely with each other.

Step 2: Start up the JobTracker


You are now ready to start the JobTracker.
If you're using the /etc/init.d/hadoop-0.20-mapreduce-jobtracker script, then you can use the service
command to run it now:
$ sudo service hadoop-0.20-mapreduce-jobtracker start

You can verify that the JobTracker is working properly by opening a web browser to http://machine:50030/
where machine is the name of the machine where the JobTracker is running.

Step 3: Start up a TaskTracker


You are now ready to start a TaskTracker.

CDH 5 Security Guide | 37

Configuring Hadoop Security in CDH 5


If you're using the /etc/init.d/hadoop-0.20-mapreduce-tasktracker script, then you can use the service
command to run it now:
$ sudo service hadoop-0.20-mapreduce-tasktracker start

Step 4: Try Running a Map/Reduce Job


You should now be able to run Map/Reduce jobs. To confirm, try launching a sleep or a pi job from the provided
Hadoop examples (/usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar). Note that you will need
Kerberos credentials to do so.
Important:
Remember that the user who launches the job must exist on every node.

Configuring YARN Security


If you are using MRv1, skip this section and see Configuring MRv1 Security.
If you are using YARN, do the following steps to configure, start, and test secure YARN.
1.
2.
3.
4.
5.

Configure Secure YARN.


Start up the ResourceManager.
Start up the NodeManager.
Start up the MapReduce Job History Server.
Try Running a Map/Reduce YARN Job.

Step 1: Configure Secure YARN


Before you start:
The Kerberos principals for the ResourceManager and NodeManager are configured in the yarn-site.xml
file. The same yarn-site.xml file must be installed on every host machine in the cluster.
Make sure that each user who will be running YARN jobs exists on all cluster nodes (that is, on every node
that hosts any YARN daemon).
To configure secure YARN:
1. Add the following properties to the yarn-site.xml file on every machine in the cluster:
<!-- ResourceManager security configs -->
<property>
<name>yarn.resourcemanager.keytab</name>
<value>/etc/hadoop/conf/yarn.keytab</value> <!-- path to the YARN keytab -->
</property>
<property>
<name>yarn.resourcemanager.principal</name>
<value>yarn/[email protected]</value>
</property>
<!-- NodeManager security configs -->
<property>
<name>yarn.nodemanager.keytab</name>
<value>/etc/hadoop/conf/yarn.keytab</value> <!-- path to the YARN keytab -->
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>yarn/[email protected]</value>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>

38 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>yarn</value>
</property>
<!-- To enable SSL -->
<property>
<name>yarn.http.policy</name>
<value>HTTPS_ONLY</value>
</property>

2. Add the following properties to the mapred-site.xml file on every machine in the cluster:
<!-- MapReduce Job History Server security configs -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>host:port</value> <!-- Host and port of the MapReduce Job History Server;
default port is 10020 -->
</property>
<property>
<name>mapreduce.jobhistory.keytab</name>
<value>/etc/hadoop/conf/mapred.keytab</value> <!-- path to the MAPRED keytab for
the Job History Server -->
</property>
<property>
<name>mapreduce.jobhistory.principal</name>
<value>mapred/[email protected]</value>
</property>
<!-- To enable SSL -->
<property>
<name>mapreduce.jobhistory.http.policy</name>
<value>HTTPS_ONLY</value>
</property>

3. Create a file called container-executor.cfg for the Linux Container Executor program that contains the
following information:
yarn.nodemanager.local-dirs=<comma-separated list of paths to local NodeManager
directories. Should be same values specified in yarn-site.xml. Required to validate
paths passed to container-executor in order.>
yarn.nodemanager.linux-container-executor.group=yarn
yarn.nodemanager.log-dirs=<comma-separated list of paths to local NodeManager log
directories. Should be same values specified in yarn-site.xml. Required to set
proper permissions on the log files so that they can be written to by the user's
containers and read by the NodeManager for log aggregation.
banned.users=hdfs,yarn,mapred,bin
min.user.id=1000

Note:
In the container-executor.cfg file, the default setting for the banned.users property is hdfs,
yarn, mapred, and bin to prevent jobs from being submitted via those user accounts. The default
setting for the min.user.id property is 1000 to prevent jobs from being submitted with a user
ID less than 1000, which are conventionally Unix super users. Note that some operating systems
such as CentOS 5 use a default value of 500 and above for user IDs, not 1000. If this is the case
on your system, change the default setting for the min.user.id property to 500. If there are user
accounts on your cluster that have a user ID less than the value specified for the min.user.id
property, the NodeManager returns an error code of 255.
4. The path to the container-executor.cfg file is determined relative to the location of the container-executor
binary. Specifically, the path is <dirname of container-executor

CDH 5 Security Guide | 39

Configuring Hadoop Security in CDH 5


binary>/../etc/hadoop/container-executor.cfg. If you installed the CDH 5 package, this path will
always correspond to /etc/hadoop/conf/container-executor.cfg.

Note:
The container-executor program requires that the paths including and leading up to the
directories specified in yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs to
be set to 755 permissions as shown in this table on permissions on directories.
5. Verify that the ownership and permissions of the container-executor program corresponds to:
---Sr-s--- 1 root yarn 36264 May 20 15:30 container-executor

Note:
For more information about the Linux Container Executor program, see Appendix B - Information
about Other Hadoop Security Programs.

Step 2: Start up the ResourceManager


You are now ready to start the ResourceManager.
Note:
Make sure you always start ResourceManager before starting NodeManager.
If you're using the /etc/init.d/hadoop-yarn-resourcemanager script, then you can use the service
command to run it now:
$ sudo service hadoop-yarn-resourcemanager start

You can verify that the ResourceManager is working properly by opening a web browser to http://host:8088/
where host is the name of the machine where the ResourceManager is running.

Step 3: Start up the NodeManager


You are now ready to start the NodeManager.
If you're using the /etc/init.d/hadoop-yarn-nodemanager script, then you can use the service command
to run it now:
$ sudo service hadoop-yarn-nodemanager start

You can verify that the NodeManager is working properly by opening a web browser to http://host:8042/ where
host is the name of the machine where the NodeManager is running.

Step 4: Start up the MapReduce Job History Server


You are now ready to start the MapReduce Job History Server.
If you're using the /etc/init.d/hadoop-mapreduce-historyserver script, then you can use the service
command to run it now:
$ sudo service hadoop-mapreduce-historyserver start

40 | CDH 5 Security Guide

Configuring Hadoop Security in CDH 5


You can verify that the MapReduce JobHistory Server is working properly by opening a web browser to
http://host:19888/ where host is the name of the machine where the MapReduce JobHistory Server is running.

Step 5: Try Running a Map/Reduce YARN Job


You should now be able to run Map/Reduce jobs. To confirm, try launching a sleep or a pi job from the provided
Hadoop examples (/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar). Note that you will
need Kerberos credentials to do so.
Important:
Remember that the user who launches the job must exist on every node.
To try running a MapReduce job using YARN, set the HADOOP_MAPRED_HOME environment variable and then
submit the job. For example:
$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
$ /usr/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10
10000

Enabling HDFS Extended ACLs


As of CDH 5.1, HDFS will support POSIX Access Control Lists (ACLs), in addition to the traditional POSIX permissions
model already supported. ACLs provide fine-grained control of permissions for HDFS files by providing a way to
set different permissions for specific named users or named groups.

Enabling ACLs
By default, ACLs are disabled on a cluster. To enable them, set the dfs.namenode.acls.enabled property to
true in the NameNode's hdfs-site.xml.
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>

Commands
You can use the File System Shell commands, setfacl and getfacl, to modify and retrieve files' ACLs.
getfacl
hdfs dfs -getfacl [-R] <path>
<!-- COMMAND OPTIONS
<path>: Path to the file or directory for which ACLs should be listed.
-R: Use this option to recursively list ACLs for all files and directories.
-->

Examples:
<!-- To list all ACLs for the file located at /user/hdfs/file -->
hdfs dfs -getfacl /user/hdfs/file
<!-- To recursively list ACLs for /user/hdfs/file -->
hdfs dfs -getfacl -R /user/hdfs/file

CDH 5 Security Guide | 41

Configuring Hadoop Security in CDH 5


setfacl
hdfs dfs -setfacl [-R] [-b|-k -m|-x <acl_spec> <path>]|[--set <acl_spec> <path>]
<!-- COMMAND OPTIONS
<path>: Path to the file or directory for which ACLs should be set.
-R: Use this option to recursively list ACLs for all files and directories.
-b: Revoke all permissions except the base ACLs for user, groups and others.
-k: Remove the default ACL.
-m: Add new permissions to the ACL with this option. Does not affect existing
permissions.
-x: Remove only the ACL specified.
<acl_spec>: Comma-separated list of ACL permissions.
--set: Use this option to completely replace the existing ACL for the path specified.
Previous ACL entries will no longer apply.
-->

Examples:
<!-- To give user ben read & write permission over /user/hdfs/file -->
hdfs dfs -setfacl -m user:ben:rw- /user/hdfs/file
<!-- To remove user alice's ACL entry for /user/hdfs/file -->
hdfs dfs -setfacl -x user:alice /user/hdfs/file
<!-- To give user hadoop read & write access, and group or others read-only access -->
hdfs dfs -setfacl --set user:hadoop:rw-,group::r--,other::r-- /user/hdfs/file

More details about using this feature can be found here.

42 | CDH 5 Security Guide

Sentry Policy File Configuration

Sentry Policy File Configuration


Important: This is the documentation for configuring Sentry using the policy file approach. Cloudera
recommends you use the database-backed Sentry service introduced in CDH 5.1 to secure your data.
See Sentry Service Configuration on page 57 for more information.
Sentry enables role-based, fine-grained authorization for HiveServer2 and Cloudera Impala. It provides classic
database-style authorization for Hive and Impala. Follow the instructions below to install and configure Sentry
manually under the current CDH release.

Prerequisites
Sentry depends on an underlying authentication framework to reliably identify the requesting user. It requires:
CDH 5
HiveServer2 with strong authentication (Kerberos or LDAP)
A secure Hadoop cluster
This is to prevent a user bypassing the authorization and gaining direct access to the underlying data.
In addition to the above, make sure that the following are true:
The Hive warehouse directory (/user/hive/warehouse or any path you specify as
hive.metastore.warehouse.dir in your hive-site.xml) must be owned by the Hive user and group.
Permissions on the warehouse directory must be set as follows (see following Note for caveats):
771 on the directory itself (for example, /user/hive/warehouse)
771 on all subdirectories (for example, /user/hive/warehouse/mysubdir)
All files and subdirectories should be owned by hive:hive
For example:
$ sudo -u hdfs hdfs dfs -chmod -R 771 /user/hive/warehouse
$ sudo -u hdfs hdfs dfs -chown -R hive:hive /user/hive/warehouse

Note:
If you set hive.warehouse.subdir.inherit.perms to true in hive-site.xml, the
permissions on the subdirectories will be set when you set permissions on the warehouse
directory itself.
If a user has access to any object in the warehouse, that user will be able to execute use
default. This ensures that use default commands issued by legacy applications work
when Sentry is enabled. Note that you can protect objects in the default database (or any
other database) by means of a policy file.

Important: These instructions override the recommendations in the Hive section of the CDH
5 Installation Guide.
HiveServer2 impersonation must be turned off.
The Hive user must be able to submit MapReduce jobs. You can ensure that this is true by setting the
minimum user ID for job submission to 0. Edit the taskcontroller.cfg file and set min.user.id=0.

CDH 5 Security Guide | 43

Sentry Policy File Configuration


To enable the Hive user to submit YARN jobs, add the user hive to the allowed.system.users configuration
property. Edit the container-executor.cfg file and add hive to the allowed.system.users property.
For example,
allowed.system.users=nobody,impala,hive

Important:
You must restart the cluster and HiveServer2 after changing this value, whether you use
Cloudera Manager or not.
These instructions override the instructions under Configuring MRv1 Security on page 36
These instructions override the instructions under Configuring YARN Security on page 38

Roles and Privileges


Sentry uses a role-based privilege model. A role is a collection of rules for accessing a given Hive object. The
objects supported in the current release are server, database, table, and URI. Access to each object is governed
by privileges: Select, Insert, or All.
Note: All is not supported explicitly in the table scope; you have to specify Select and Insert
explicitly.
For example, a rule for the Select privilege on table customers from database sales would be formulated as
follows:
server=server1->db=sales->table=customer->action=Select

Each object must be specified as a hierarchy of the containing objects, from server to table, followed by the
privilege granted for that object. A role can contain multiple such rules, separated by commas. For example a
role might contain the Select privilege for the customer and items tables in the sales database, and the
Insert privilege for the sales_insights table in the reports database. You would specify this as follows:
sales_reporting =
\server=server1->db=sales->table=customer->action=Select,
\server=server1->db=sales->table=items>action=Select,
\server=server1->db=reports->table=sales_insights>action=Insert

Privilege Model
With CDH 5.1, the privilege model has undergone changes to accomodate the new grant/revoke syntax that is
used with the Sentry service. These changes are common to both the new database-backed Sentry service, as
well as the previous policy file approach.
The Sentry privilege model has the following characteristics:
Allows any user to execute show function, desc function, and show locks.
Allows the user to see only those tables and databases for which this user has privileges.
Requires a user to have the necessary privileges on the URI to execute HiveQL operations that take in a
location. Examples of such operations include LOAD, IMPORT, and EXPORT.
Important: When Sentry is enabled, a user with no privileges on a database will not be allowed to
connect to HiveServer2. This is because the use <database> command is now executed as part of
the connection to HiveServer2, which is why the connection fails. See HIVE-4256.

44 | CDH 5 Security Guide

Sentry Policy File Configuration


For more information, see Appendix: Authorization Privilege Model for Hive and Impala on page 53.

Users and Groups


A user is an entity that is permitted by the authentication subsystem to access the Hive service. This entity
can be a Kerberos principal, an LDAP userid, or an artifact of some other pluggable authentication system
supported by HiveServer2.
A group connects the authentication system with the authorization system. It is a collection of one or more
users who have been granted one or more authorization roles. Sentry allows a set of roles to be configured
for a group.
A configured group provider determines a users affiliation with a group. The current release supports
HDFS-backed groups and locally configured groups.
For example,
analyst = sales_reporting, data_export, audit_report

Here the group analyst is granted the roles sales_reporting, data_export, and audit_report. The members
of this group can run the HiveQL statements that are allowed by these roles. If this is an HDFS-backed group,
then all the users belonging to the HDFS group analyst can run such queries.

User to Group Mapping


You can configure Sentry to use either Hadoop groups or groups defined in the policy file. By default, Sentry
looks up groups locally, but it can be configured to look up Hadoop groups using LDAP (for Active Directory).
Local groups will be looked up on the host Sentry runs on. For Hive, this will be the host running HiveServer2.
Group mappings in Sentry can be summarized as in the figure below:

Important: You can use either Hadoop groups or local groups, but not both at the same time. Use
local groups if you want to do a quick proof-of-concept. For production, use Hadoop groups. Refer
Appendix I - Configuring LDAP Group Mappings on page 211 for details on configuring LDAP group
mappings in Hadoop.

CDH 5 Security Guide | 45

Sentry Policy File Configuration


Configuring Hadoop Groups
Set the hive.sentry.provider property in sentry-site.xml.
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>

Configuring Local Groups


1. Define local groups in a [users] section of the Policy file on page 47. For example:
[users]
user1 = group1, group2, group3
user2 = group2, group3

2. In sentry-site.xml, set hive.sentry.provider as follows:


<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.LocalGroupResourceAuthorizationProvider</value>
</property>

Setup and Configuration


Sentry stores the configuration as well as privilege policies in files. The sentry-site.xml file contains
configuration options such as group association provider, privilege policy file location, and so
on. The Policy file on page 47 contains the privileges and groups. It has a .ini file format and can be stored on
a local file system or HDFS.
Sentry is plugged into Hive as session hooks which you configure in hive-site.xml. The sentry package must
be installed; it contains the required JAR files. You must also configure properties in the Sentry Configuration
File.

Installing and Upgrading Sentry


Important:
If you have not already done so, install Cloudera's yum, zypper/YaST or apt repository before using
the following commands. For instructions, see CDH 5 Installation.

Upgrading Sentry from CDH 4 to CDH 5


To upgrade Sentry from CDH 4 to CDH 5, you must uninstall the old version and install the new version. If you
have already performed the steps to uninstall CDH 4 and all components, as described under Upgrading from
CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH 5 version of Sentry.
1. Remove the CDH 4 Version of Sentry
Remove Sentry as follows, depending on your operating system:
OS

Command

RHEL

$ sudo yum remove sentry

SLES

$ sudo zypper remove sentry

Ubuntu or Debian

$ sudo apt-get remove sentry

46 | CDH 5 Security Guide

Sentry Policy File Configuration


2. Install the New Version of Sentry
Follow instructions in the next section to install the CDH 5 version of Sentry.
Important: Configuration files
If you install a newer version of a package that is already on the system, configuration files
that you have modified will remain intact.
If you uninstall a package, the package manager renames any configuration files you have
modified from <file> to <file>.rpmsave. If you then re-install the package (probably to
install a new version) the package manager creates a new <file> with applicable defaults.
You are responsible for applying any changes captured in the original configuration file to the
new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you
have made changes to a file for which there is a new version; for details, see Automatic handling
of configuration files by dpkg.
The upgrade is now complete.
Installing Sentry
Install Sentry as follows, depending on your operating system:
OS

Command

RHEL

$ sudo yum install sentry

SLES

$ sudo zypper install sentry

Ubuntu or Debian

$ sudo apt-get update;


$ sudo apt-get install sentry

Policy file
The sections that follow contain notes on creating and maintaining the policy file, and using URIs to load external
data and JARs.
Warning: An invalid policy file will be ignored while logging an exception. This will lead to a situation
where users will lose access to all Sentry-protected data, since default Sentry behaviour is deny
unless a user has been explicitly granted access. (Note that if only the per-DB policy file is invalid, it
will invalidate only the policies in that file.)
Storing the Policy File
Considerations for storing the policy file(s) in HDFS include:
1. Replication count - Because the file is read for each query in Hive and read once every five minutes by all
Impala daemons, you should increase this value; since it is a small file, setting the replication count equal to
the number of slave nodes in the cluster is reasonable.
2. Updating the file - Updates to the file are reflected immediately, so you should write them to a temporary
copy of the file first, and then replace the existing file with the temporary one after all the updates are
complete. This avoids race conditions caused by reads on an incomplete file.
Defining Roles
Keep in mind that role definitions are not cumulative; the the definition that is further down in the file replaces
the older one. For example, the following results in role1 having privilege2, not privilege1 and privilege2.
role1 = privilege1
role1 = privilege2

CDH 5 Security Guide | 47

Sentry Policy File Configuration


Role names are scoped to a specific file. For example, if you give role1 the ALL privilege on db1 in the global
policy file and give role1 ALL on db2 in the per-db db2 policy file, the user will be given both privileges.
URIs
Any command which references a URI such as CREATE TABLE EXTERNAL, LOAD, IMPORT, EXPORT, and more, in
addition to CREATE TEMPORARY FUNCTION requires the URI privilege. This is an important security control
because without this users could simply create an external table over an existing table they do not have access
to and bypass Sentry.
URIs must start with either hdfs:// or file://. If a URI starts with anything else, it will cause an exception
and the policy file will be invalid.
When defining URIs for HDFS, you must also specify the NameNode. For example:
data_read = server=server1->uri=file:///path/to/dir,\
server=server1->uri=hdfs://namenode:port/path/to/dir

Important: Because the NameNode host and port must be specified, Cloudera strongly recommends
you use High Availability (HA). This ensures that the URI will remain constant even if the NameNode
changes.
To enable URIs in per-DB policy files, add the following string to the Java configuration options for HiveServer2
during startup.
-Dsentry.allow.uri.db.policyfile=true

Important: Enabling URIs in per-DB policy files introduces a security risk by allowing the owner of
the db-level policy file to grant himself/herself load privileges to anything the hive user has read
permissions for in HDFS (including data in other databases controlled by different db-level policy
files).
Loading Data
Data can be loaded using a landing skid, either in HDFS or via a local/NFS directory where HiveServer2/Impala
run. The following privileges can be used to grant a role access to a loading skid:
Load data from a local/NFS directory:
server=server1->uri=file:///path/to/nfs/local/to/nfs

Load data from HDFS (MapReduce, Pig, and so on):


server=server1->uri=hdfs://ha-nn-uri/data/landing-skid

In addition to the privilege in Sentry, the hive or impala user will require the appropriate file permissions to
access the data being loaded. Groups can be used for this purpose. For example, create a group hive-users,
and add the hive and impala users along with the users who will be loading data, to this group.
The example usermod and groupadd commands below are only applicable to locally defined groups on the
NameNode, JobTracker, and ResourceManager. If you use another system for group management, equivalent
changes should be made in your group management system.
$ groupadd hive-users
$ usermod -G someuser,hive-users someuser
$ usermod -G hive,hive-users hive

48 | CDH 5 Security Guide

Sentry Policy File Configuration


External Tables
External tables require the ALL@database privilege in addition to the URI privilege. When data is being inserted
through the EXTERNAL TABLE statement, or is referenced from an HDFS location outside the normal Hive
database directories, the user needs appropriate permissions on the URIs corresponding to those HDFS locations.
This means that the URI location must either be owned by the hive:hive user OR the hive/impala users must
be members of the group that owns the directory.
You can configure access to the directory using a URI as follows:
[roles]
someuser_home_dir_role = server=server1->uri=hdfs://ha-nn-uri/user/someuser

You should now be able to create an external table:


CREATE EXTERNAL TABLE ...
LOCATION 'hdfs://ha-nn-uri/user/someuser/mytable';

JARs and User-Defined Functions (UDFs)


The ADD JAR command does not work with HiveServer2 & the Beeline client when Beeline runs on a different
host. As an alternative to ADD JAR, Hive's auxiliary paths functionality should be used as described in the
following steps.
Note: If you have a cluster managed by Cloudera Manager, please see Using User-Defined Functions
(UDFs) with HiveServer2.
1. On the Beeline client machine, in /etc/hive/conf/hive-site.xml, set the hive.aux.jars.path property
to a comma-separated list of the fully-qualified paths to the JAR file and any dependent libraries.
hive.aux.jars.path=file:/opt/local/hive/lib/my.jar

2. Copy the JAR file (and its dependent libraries) to the host running HiveServer2/Impala.
3. On the HiveServer2/Impala host, open /etc/default/hive-server2 and set the AUX_CLASSPATH variable
to a comma-separated list of the fully-qualified paths to the JAR file and any dependent libraries.
AUX_CLASSPATH=/opt/local/hive/lib/my.jar

4. To access the UDF, you must have URI privilege to the jar where the UDF resides. This privilege prevents
users from creating functions such as the reflect function which is disallowed because it allows users to
execute arbitrary Java code.
udf_r = server=server1->uri=file:///opt/local/hive/lib

5. Restart HiveServer2.
You should now be able to use the UDF:
CREATE TEMPORARY FUNCTION my_udf AS 'MyUDF';

Sample Configuration
This section provides a sample configuration.
Policy Files
The following is an example of a policy file with a per-DB policy file. In this example, the first policy file,
sentry-provider.ini would exist in HDFS; hdfs://ha-nn-uri/etc/sentry/sentry-provider.ini might
CDH 5 Security Guide | 49

Sentry Policy File Configuration


be an appropriate location. The per-DB policy file is for the customer's database. It is located at
hdfs://ha-nn-uri/etc/sentry/customers.ini.
sentry-provider.ini
[databases]
# Defines the location of the per DB policy file for the customers DB/schema
customers = hdfs://ha-nn-uri/etc/sentry/customers.ini
[groups]
# Assigns each Hadoop group to its set of roles
manager = analyst_role, junior_analyst_role
analyst = analyst_role
jranalyst = junior_analyst_role
customers_admin = customers_admin_role
admin = admin_role
[roles]
# The uris below define a define a landing skid which
# the user can use to import or export data from the system.
# Since the server runs as the user "hive" files in that directory
# must either have the group hive and read/write set or
# be world read/write.
analyst_role = server=server1->db=analyst1, \
server=server1->db=jranalyst1->table=*->action=select
server=server1->uri=hdfs://ha-nn-uri/landing/analyst1
junior_analyst_role = server=server1->db=jranalyst1, \
server=server1->uri=hdfs://ha-nn-uri/landing/jranalyst1
# Implies everything on server1 -> customers. Privileges for
# customers can be defined in the global policy file even though
# customers has its only policy file. Note that the Privileges from
# both the global policy file and the per-DB policy file
# are merged. There is no overriding.
customers_admin_role = server=server1->db=customers
# Implies everything on server1.
admin_role = server=server1

customers.ini
[groups]
manager = customers_insert_role, customers_select_role
analyst = customers_select_role
[roles]
customers_insert_role = server=server1->db=customers->table=*->action=insert
customers_select_role = server=server1->db=customers->table=*->action=select

Important: Sentry does not support using the view keyword in policy files. If you want to define a
role against a view, use the keyword table instead. For example, to define the role analyst_role
against the view col_test_view:
[roles]
analyst_role = server=server1->db=default->table=col_test_view->action=select

Sentry Configuration File


The following is an example of a sentry-site.xml file.

50 | CDH 5 Security Guide

Sentry Policy File Configuration


Important: If you are using Cloudera Manager 4.6 (or earlier), make sure you do not store
sentry-site.xml in /etc/hive/conf; that directory is regenerated whenever the Hive client
configurations are redeployed. Instead, use a directory such as /etc/sentry to store the sentry
file.
If you are using Cloudera Manager 4.7 (or later), Cloudera Manager will create and deploy
sentry-site.xml for you. See Sentry for Hive Authorization for more details on configuring Sentry
with Cloudera Manager.
sentry-site.xml
<configuration>
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>
<property>
<name>hive.sentry.provider.resource</name>
<value>/path/to/authz-provider.ini</value>
<!-If the hdfs-site.xml points to HDFS, the path will be in HDFS;
alternatively you could specify a full path, e.g.:
hdfs://namenode:port/path/to/authz-provider.ini
file:///path/to/authz-provider.ini
-->
</property>
<property>
<name>sentry.hive.server</name>
<value>server1</value>
</property>
</configuration>

Enabling Sentry in HiveServer2


To enable Sentry, add the following properties to hive-site.xml:
<property>
<name>hive.server2.session.hook</name>
<value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>
<property>
<name>hive.sentry.conf.url</name>
<value></value>
<description>sentry-site.xml file location</description>
</property>

Securing the Hive Metastore


It's important that the Hive metastore be secured. Do this by turning on Hive metastore security, using the
instructions in the CDH 5 Security Guide:
Secure the Hive metastore; see Hive MetaStoreServer Security Configuration.
In addition, allow access to the metastore only from the HiveServer2 server (see "Securing the Hive Metastore"
under HiveServer2 Security Configuration) and then disable local access to the HiveServer2 server.

CDH 5 Security Guide | 51

Sentry Policy File Configuration

Accessing Sentry-Secured Data Outside Hive/Impala


When Sentry is enabled, the hive user owns all data within the Hive warehouse. However, unlike traditional
database systems the enterprise data hub allows for multiple engines to execute over the same dataset.
Note: Cloudera strongly recommends you use Hive/Impala SQL queries to access data secured by
Sentry, as opposed to accessing the data files directly.
However, there are scenarios where fully vetted and reviewed jobs will also need to access the data stored in
the Hive warehouse. A typical scenario would be a secured MapReduce transformation job that is executed
automatically as an application user. In such cases it's important to know that the user executing this job will
also have full access to the data in the Hive warehouse.

Scenario One: Authorizing Jobs


Problem
A reviewed, vetted, and automated job requires access to the Hive warehouse and cannot use Hive/Impala to
access the data.
Solution
Create a group which contains hive, impala, and the user executing the automated job.
For example, if the etl user is executing the automated job, you can create a group called hive-users which
contains the hive, impala, and etl users.
The example usermod and groupadd commands below are only applicable to locally defined groups on the
NameNode, JobTracker, and ResourceManager. If you use another system for group management, equivalent
changes should be made in your group management system.
$
$
$
$

groupadd hive-users
usermod -G hive,impala,hive-users hive
usermod -G hive,impala,hive-users impala
usermod -G etl,hive-users etl

Once you have added users to the hive-users group, change directory permissions in the HDFS:
$ hadoop fs -chgrp -R hive:hive-users /user/hive/warehouse
$ hadoop fs -chmod -R 770 /user/hive/warehouse

Scenario Two: Authorizing Group Access to Databases


Problem
One group of users, grp1 should have full access to the database, db1, outside of Sentry. The database, db1
should not be accessible to any other groups, outside of Sentry. Sentry should be used for all other authorization
needs.
Solution
Place the hive and impala users in grp1.
$ usermod -G hive,impala,grp1 hive
$ usermod -G hive,impala,grp1 impala

52 | CDH 5 Security Guide

Sentry Policy File Configuration


Then change group ownerships of all directories and files in db1 to grp1, and modify directory permissions in
the HDFS. This example is only applicable to local groups on a single host.
$ hadoop fs -chgrp -R hive:grp1 /user/hive/warehouse/db1.db
$ hadoop fs -chmod -R 770 /user/hive/warehouse/db1.db

Debugging Failed Sentry Authorization Requests


Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why
Sentry is denying access, the best way to debug is to temporarily turn on debug logging:
In Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the logging settings for your service
through the corresponding Logging Safety Valve field for the Impala, Hive Server 2, or Solr Server services.
On systems not managed by Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the
log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as:
FilePermission server..., RequestPermission server...., result [true|false]

which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while
RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate
FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating
Access Denied.

Appendix: Authorization Privilege Model for Hive and Impala


Privileges can be granted on different objects in the Hive warehouse. Any privilege that can be granted is
associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the
base object automatically inherits it. For instance, if a user has ALL privileges on the database scope, then (s)he
has ALL privileges on all of the base objects contained within that scope.

Object Hierarchy in Hive


Server
Database
Table
Partition
Columns
View
Index
Function/Routine
Lock

Table 3: Valid privilege types and objects they apply to


Privilege

Object

INSERT

DB, TABLE

SELECT

DB, TABLE

ALL

SERVER, TABLE, DB, URI

CDH 5 Security Guide | 53

Sentry Policy File Configuration


Table 4: Privilege hierarchy
Base Object

Granular privileges on
object

Container object that


contains the base object

Privileges on container
object that implies
privileges on the base
object

DATABASE

ALL

SERVER

ALL

TABLE

INSERT

DATABASE

ALL

TABLE

SELECT

DATABASE

ALL

VIEW

SELECT

DATABASE

ALL

Table 5: Privilege table for Hive & Impala operations


Operation

Scope

Privileges

CREATE DATABASE

SERVER

ALL

DROP DATABASE

DATABASE

ALL

CREATE TABLE

DATABASE

ALL

DROP TABLE

TABLE

ALL

CREATE VIEW

DATABASE; SELECT on ALL


TABLE

DROP VIEW

VIEW/TABLE

ALL

CREATE INDEX

TABLE

ALL

DROP INDEX

TABLE

ALL

ALTER TABLE .. ADD


COLUMNS

TABLE

ALL

ALTER TABLE .. REPLACE TABLE


COLUMNS

ALL

ALTER TABLE .. CHANGE TABLE


column

ALL

ALTER TABLE .. RENAME TABLE

ALL

ALTER TABLE .. SET


TBLPROPERTIES

TABLE

ALL

ALTER TABLE .. SET


FILEFORMAT

TABLE

ALL

ALTER TABLE .. SET


LOCATION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION location

TABLE

ALL

ALTER TABLE .. DROP


PARTITION

TABLE

ALL

54 | CDH 5 Security Guide

URI

Others

SELECT on TABLE

URI

URI

Sentry Policy File Configuration


Operation

Scope

Privileges

URI

ALTER TABLE ..
PARTITION SET
FILEFORMAT

TABLE

ALL

SHOW TBLPROPERTIES

TABLE

SELECT/INSERT

SHOW CREATE TABLE

TABLE

SELECT/INSERT

SHOW PARTITIONs

TABLE

SELECT/INSERT

DESCRIBE TABLE

TABLE

SELECT/INSERT

DESCRIBE TABLE ..
PARTITION

TABLE

SELECT/INSERT

LOAD DATA

TABLE

INSERT

SELECT

TABLE

SELECT

INSERT OVERWRITE
TABLE

TABLE

INSERT

CREATE TABLE .. AS
SELECT

DATABASE; SELECT on ALL


TABLE

USE <dbName>

Any

ALTER TABLE .. SET


SERDEPROPERTIES

TABLE

ALL

ALTER TABLE ..
PARTITION SET
SERDEPROPERTIES

TABLE

ALL

INSERT OVERWRITE
DIRECTORY

TABLE

INSERT

Analyze TABLE

TABLE

SELECT + INSERT

IMPORT TABLE

DATABASE

ALL

URI

EXPORT TABLE

TABLE

SELECT

URI

ALTER TABLE TOUCH

TABLE

ALL

ALTER TABLE TOUCH


PARTITION

TABLE

ALL

ALTER TABLE ..
CLUSTERED BY SORTED
BY

TABLE

ALL

ALTER TABLE ..
ENABLE/DISABLE

TABLE

ALL

ALTER TABLE ..
PARTITION
ENABLE/DISABLE

TABLE

ALL

ALTER TABLE ..
TABLE
PARTITION.. RENAME TO
PARTITION

ALL

Others

URI

SELECT on TABLE

Hive-Only Operations
URI

CDH 5 Security Guide | 55

Sentry Policy File Configuration


Operation

Scope

Privileges

ALTER DATABASE

DATABASE

ALL

DESCRIBE DATABASE

DATABASE

SELECT/INSERT

SHOW COLUMNS

TABLE

SELECT/INSERT

SHOW INDEXES

TABLE

SELECT/INSERT

GRANT PRIVILEGE

Allowed only for


Sentry admin users

REVOKE PRIVILEGE

Allowed only for


Sentry admin users

SHOW GRANTS

Allowed only for


Sentry admin users

ADD JAR

Not Allowed

ADD FILE

Not Allowed

DFS

Not Allowed

Impala-Only Operations
EXPLAIN

TABLE

SELECT

INVALIDATE METADATA

SERVER

ALL

INVALIDATE METADATA
<table name>

TABLE

SELECT/INSERT

REFRESH <table name>

TABLE

SELECT/INSERT

CREATE FUNCTION

SERVER

ALL

DROP FUNCTION

SERVER

ALL

COMPUTE STATS

TABLE

ALL

56 | CDH 5 Security Guide

URI

Others

Sentry Service Configuration

Sentry Service Configuration


Important: This is the documentation for the Sentry service introduced in CDH 5.1. If you want to use
Sentry's previous policy file approach to secure your data, see Sentry Policy File Configuration on page
43 for more information.
The Sentry service is an RPC server that stores the authorization metadata in an underlying relational database
and provides RPC interfaces to retrieve and manipulate privileges. It supports secure access to services using
Kerberos. The Hive and Impala services are clients of this service. The service provides authorization metadata
from the database-backed storage; it does not handle actual privilege validation.
The motivation behind introducing a new Sentry service is to make it easier to handle user privileges than the
existing policy file approach. Providing a database instead, allows you to use the more traditional GRANT/REVOKE
statements to modify privileges.

Prerequisites
Sentry depends on an underlying authentication framework to reliably identify the requesting user. It requires:
CDH 5.1.x
HiveServer2 with strong authentication (Kerberos or LDAP)
A secure Hadoop cluster
This is to prevent a user bypassing the authorization and gaining direct access to the underlying data.
In addition to the above, make sure that the following are true:
The Hive warehouse directory (/user/hive/warehouse or any path you specify as
hive.metastore.warehouse.dir in your hive-site.xml) must be owned by the Hive user and group.
Permissions on the warehouse directory must be set as follows (see following Note for caveats):
771 on the directory itself (for example, /user/hive/warehouse)
771 on all subdirectories (for example, /user/hive/warehouse/mysubdir)
All files and subdirectories should be owned by hive:hive
For example:
$ sudo -u hdfs hdfs dfs -chmod -R 771 /user/hive/warehouse
$ sudo -u hdfs hdfs dfs -chown -R hive:hive /user/hive/warehouse

Note:
If you set hive.warehouse.subdir.inherit.perms to true in hive-site.xml, the
permissions on the subdirectories will be set when you set permissions on the warehouse
directory itself.
If a user has access to any object in the warehouse, that user will be able to execute use
default. This ensures that use default commands issued by legacy applications work
when Sentry is enabled. Note that you can protect objects in the default database (or any
other database) by means of a policy file.

Important: These instructions override the recommendations in the Hive section of the CDH
5 Installation Guide.

CDH 5 Security Guide | 57

Sentry Service Configuration


HiveServer2 impersonation must be turned off.
The Hive user must be able to submit MapReduce jobs. You can ensure that this is true by setting the
minimum user ID for job submission to 0. Edit the taskcontroller.cfg file and set min.user.id=0.
To enable the Hive user to submit YARN jobs, add the user hive to the allowed.system.users configuration
property. Edit the container-executor.cfg file and add hive to the allowed.system.users property.
For example,
allowed.system.users=nobody,impala,hive

Important:
You must restart the cluster and HiveServer2 after changing this value, whether you use
Cloudera Manager or not.
These instructions override the instructions under Configuring MRv1 Security on page 36
These instructions override the instructions under Configuring YARN Security on page 38

Privilege Model
With CDH 5.1, the privilege model has undergone changes to accomodate the new grant/revoke syntax that is
used with the Sentry service. These changes are common to both the new database-backed Sentry service, as
well as the previous policy file approach.
The Sentry privilege model has the following characteristics:
Allows any user to execute show function, desc function, and show locks.
Allows the user to see only those tables and databases for which this user has privileges.
Requires a user to have the necessary privileges on the URI to execute HiveQL operations that take in a
location. Examples of such operations include LOAD, IMPORT, and EXPORT.
Important: When Sentry is enabled, a user with no privileges on a database will not be allowed to
connect to HiveServer2. This is because the use <database> command is now executed as part of
the connection to HiveServer2, which is why the connection fails. See HIVE-4256.
For more information, see Appendix: Authorization Privilege Model for Hive and Impala on page 66.

Users and Groups


A user is an entity that is permitted by the authentication subsystem to access the Hive service. This entity
can be a Kerberos principal, an LDAP userid, or an artifact of some other pluggable authentication system
supported by HiveServer2.
A group connects the authentication system with the authorization system. It is a collection of one or more
users who have been granted one or more authorization roles. Sentry allows a set of roles to be configured
for a group.
A configured group provider determines a users affiliation with a group. The current release supports
HDFS-backed groups and locally configured groups.

User to Group Mapping


You can configure Sentry to use either Hadoop groups or groups defined in the policy file. By default, Sentry
looks up groups locally, but it can be configured to look up Hadoop groups using LDAP (for Active Directory).
Local groups will be looked up on the host Sentry runs on. For Hive, this will be the host running HiveServer2.
Group mappings in Sentry can be summarized as in the figure below:
58 | CDH 5 Security Guide

Sentry Service Configuration

Important: You can use either Hadoop groups or local groups, but not both at the same time. Use
local groups if you want to do a quick proof-of-concept. For production, use Hadoop groups. Refer
Appendix I - Configuring LDAP Group Mappings on page 211 for details on configuring LDAP group
mappings in Hadoop.
Configuring Hadoop Groups
Set the hive.sentry.provider property in sentry-site.xml.
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>

Configuring Local Groups


1. Define local groups in a [users] section of the Policy file on page 47. For example:
[users]
user1 = group1, group2, group3
user2 = group2, group3

2. In sentry-site.xml, set hive.sentry.provider as follows:


<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.LocalGroupResourceAuthorizationProvider</value>
</property>

CDH 5 Security Guide | 59

Sentry Service Configuration

Setup and Configuration


Installing and Upgrading Sentry
Upgrading Sentry from CDH 4 to CDH 5
To upgrade Sentry from CDH 4 to CDH 5, you must uninstall the old version and install the new version. If you
have already performed the steps to uninstall CDH 4 and all components, as described under Upgrading from
CDH 4 to CDH 5, you can skip Step 1 below and proceed with installing the new CDH 5 version of Sentry.
1. Remove the CDH 4 Version of Sentry
Remove Sentry as follows, depending on your operating system:
OS

Command

RHEL

$ sudo yum remove sentry

SLES

$ sudo zypper remove sentry

Ubuntu or Debian

$ sudo apt-get remove sentry

2. Install the New Version of Sentry


Follow instructions in the next section to install the CDH 5 version of Sentry.
Important: Configuration files
If you install a newer version of a package that is already on the system, configuration files
that you have modified will remain intact.
If you uninstall a package, the package manager renames any configuration files you have
modified from <file> to <file>.rpmsave. If you then re-install the package (probably to
install a new version) the package manager creates a new <file> with applicable defaults.
You are responsible for applying any changes captured in the original configuration file to the
new configuration file. In the case of Ubuntu and Debian upgrades, you will be prompted if you
have made changes to a file for which there is a new version; for details, see Automatic handling
of configuration files by dpkg.
The upgrade is now complete.
Installing Sentry
Install Sentry as follows, depending on your operating system:
OS

Command

RHEL

$ sudo yum install sentry

SLES

$ sudo zypper install sentry

Ubuntu or Debian

$ sudo apt-get update;


$ sudo apt-get install sentry

Starting the Sentry Service


Perform the following steps to start the Sentry service on your cluster.
1. Set the SENTRY_HOME and HADOOP_HOME parameters.

60 | CDH 5 Security Guide

Sentry Service Configuration


2. Create the Sentry database schema using the Sentry schematool. Sentry, by default, does not initialize the
schema. The schematool is a built-in way for you to deploy the backend schema required by the Sentry
service. For example, the following command uses the schematool to initialize the schema for a MySQL
database.
bin/sentry --command schema-tool --conffile <sentry-site.xml> --dbType mysql
--initSchema

Alternatively, you can set the sentry.verify.schema.version configuration property to false. However,
this is not recommended.
3. Start the Sentry service.
bin/sentry --command service --conffile <sentry-site.xml>

Hive SQL Syntax


Permissions stored in the Sentry service are configured through Grant and Revoke statements issued either
interactively or programmatically through the HiveServer2 SQL command line interface, Beeline. The syntax
described below is very similar to the GRANT/REVOKE commands available in well-established relational database
systems.
CREATE ROLE Statement
The CREATE ROLE statement creates a role to which privileges can be granted. Privileges can be granted to roles,
which can then be assigned to users. A user that has been assigned a role will only be able to exercise the
privileges of that role.
Only users that have administrative privileges can create/drop roles. By default, the hive, impala and hue users
have admin privileges in Sentry.
CREATE ROLE [role_name];

DROP ROLE Statement


The DROP ROLE statement can be used to remove a role from the database. Once dropped, the role will be revoked
for all users to whom it was previously assigned. Queries that are already executing will not be affected. However,
since Hive checks user privileges before executing each query, active user sessions in which the role has already
been enabled will be affected.
DROP ROLE [role_name];

GRANT ROLE Statement


The GRANT ROLE statement can be used to grant roles to groups. Only sentry admin users can grant the role to
a group.
GRANT ROLE role_name [, role_name]
TO GROUP <groupName> [,GROUP <groupName>]

REVOKE ROLE Statement


The REVOKE ROLE statement can be used to revoke roles from groups. Only sentry admin users can revoke the
role from a group.
REVOKE ROLE role_name [, role_name]
FROM GROUP <groupName> [,GROUP <groupName>]

CDH 5 Security Guide | 61

Sentry Service Configuration


GRANT <PRIVILEGE> Statement
In order to grant privileges on an object to a role, the user must be a sentry admin user.
GRANT
<PRIVILEGE> [, <PRIVILEGE> ]
ON <OBJECT> <object_name>
TO ROLE <roleName> [,ROLE <roleName>]

REVOKE <PRIVILEGE> Statement


Since only authorized admin users can create roles, consequently only sentry admin users can revoke privileges
from a group.
REVOKE
<PRIVILEGE> [, <PRIVILEGE> ]
ON <OBJECT> <object_name>
FROM ROLE <roleName> [,ROLE <roleName>]

SET ROLE Statement


The SET ROLE statement can be used to specify a role to be enabled for the current session. A user can only
enable a role that has been granted to them. Any roles not listed and not already enabled are disabled for the
current session. If no roles are enabled, the user will have the privileges granted by any of the roles that (s)he
belongs to.
To enable a specific role:
SET ROLE <roleName>;

To enable all roles:


SET ROLE ALL;

No roles enabled:
SET ROLE NONE;

SHOW Statement
To list all the roles in the system (only for sentry admin users):
SHOW ROLES;

To list all the roles in effect for the current user session:
SHOW CURRENT ROLES;

To list all the roles assigned to the given <groupName> (only for sentry admin users):
SHOW ROLE GRANT GROUP <groupName>;

The SHOW statement can also be used to list the privileges that have been granted to a role or all the grants
given to a role for a particular object.
To list all the grants for the given <roleName> (only for sentry admin users):
SHOW GRANT ROLE <roleName>;

62 | CDH 5 Security Guide

Sentry Service Configuration


To list all the grants for a role on the given <objectName> (only for sentry admin users):
SHOW GRANT ROLE <roleName> on OBJECT <objectName>;

Example: Using Grant/Revoke Statements to Match an Existing Policy File


Here is a sample policy file:
[groups]
# Assigns each Hadoop group to its set of roles
manager = analyst_role, junior_analyst_role
analyst = analyst_role
jranalyst = junior_analyst_role
customers_admin = customers_admin_role
admin = admin_role
[roles] # The uris below define a define a landing skid which
# the user can use to import or export data from the system.
# Since the server runs as the user "hive" files in that directory
# must either have the group hive and read/write set or
# be world read/write.
analyst_role = server=server1->db=analyst1, \
server=server1->db=jranalyst1->table=*->action=select
server=server1->uri=hdfs://ha-nn-uri/landing/analyst1
junior_analyst_role = server=server1->db=jranalyst1, \
server=server1->uri=hdfs://ha-nn-uri/landing/jranalyst1
# Implies everything on server1.
admin_role = server=server1

The following sections show how you can use the new GRANT statements to assign privileges to roles (and assign
roles to groups) to match the sample policy file above.
Grant privileges to analyst_role:
CREATE ROLE analyst_role;
GRANT ALL ON DATABASE analyst1 TO ROLE analyst_role;
GRANT SELECT ON DATABASE jranalyst1 TO ROLE analyst_role;
GRANT ALL ON URI 'hdfs://ha-nn-uri/landing/analyst1' \
TO ROLE analyst_role;

Grant privileges to junior_analyst_role:


CREATE ROLE junior_analyst_role;
GRANT ALL ON DATABASE jranalyst1 TO ROLE junior_analyst_role;
GRANT ALL ON URI 'hdfs://ha-nn-uri/landing/jranalyst1' \
TO ROLE junior_analyst_role;

Grant privileges to admin_role:


CREATE ROLE admin_role
GRANT ALL ON SERVER server TO ROLE admin_role;

Grant roles to groups:


GRANT ROLE admin_role TO GROUP admin;
GRANT ROLE analyst_role TO GROUP analyst;
GRANT ROLE jranalyst_role TO GROUP jranalyst;

CDH 5 Security Guide | 63

Sentry Service Configuration

Configuring HiveServer2 for the Sentry Service


Add the following property to hive-site.xml to allow the Hive service to communicate with the Sentry policy
store.
<property>
<name>hive.security.authorization.task.factory</name>
<value>org.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl</value>
</property>
<property>
<name>hive.server2.session.hook</name>
<value>org.apache.sentry.binding.hive.HiveAuthzBindingSessionHook</value>
</property>
<property>
<name>hive.sentry.conf.url</name>
<value>file:///{{CMF_CONF_DIR}}/sentry-site.xml</value>
</property>
<property>
<name>hive.security.authorization.task.factory</name>
<value>org.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl</value>
</property>

Configuring the Hive Metastore for the Sentry Service


Configuring Pig and HCatalog for the Sentry Service
Once you have the Sentry service up and running, and Hive has been configured to use the Sentry service, there
are some configuration changes you must make to your cluster to allow Pig, MapReduce (using HCatLoader,
HCatStorer) and WebHCat queries to access Sentry-secured data stored in Hive.
With HDFS extended ACLs enabled, Cloudera recommends you set the permissions for the Hive warehouse
directory, /user/hive/warehouse, to 771 so users other than the owner and group only have execute
permissions. Since by default, the /user/hive/warehouse directory is owned by hive:hive, this also restricts
requests from any other users at the HDFS level.
With these permissions, other user requests may fail, such as commands coming through Pig jobs, WebHCat
queries, and MapReduce jobs. In order to give these users access, perform the following configuration changes:
Use HDFS ACLs to define permissions on a specific directory or file of HDFS. This directory/file is generally
mapped to a database, table, partition, or a data file.
Users running these jobs should have the required permissions in Sentry to add new metadata or read
metadata from the Hive Metastore Server. For instructions on how to set up the required permissions, see
Hive SQL Syntax on page 61. You can use HiveServer2's command line interface, Beeline to update the Sentry
database with the user privileges.
Examples:
A user who is using Pig HCatLoader will require read permissions on a specific table or partition. In such a
case, you can GRANT read access to the user in Sentry and set the ACL to read and execute, on the file being
accessed.
A user who is using Pig HCatStorer will require ALL permissions on a specific table. In this case, you GRANT
ALL access to the user in Sentry and set the ACL to write and execute, on the table being used.

64 | CDH 5 Security Guide

Sentry Service Configuration


Configuring the Hive Metastore to Communicate with Sentry
Add the following properties to hive-site.xml to allow the Hive Metastore to communicate with the Sentry
policy store.
<property>
<name>hive.metastore.pre.event.listeners</name>
<value>org.apache.sentry.binding.metastore.MetastoreAuthzBinding</value>
<description>list of comma seperated listeners for metastore events.</description>
</property>
<property>
<name>hive.metastore.event.listeners</name>
<value>org.apache.sentry.binding.metastore.SentryMetastorePostEventListener</value>
<description>list of comma seperated listeners for metastore, post
events.</description>
</property>

Configuring Impala for the Sentry Service


To configure Impala as a client of the Sentry service, set the following configuration properties in
sentry-site.xml.
<property>
<name>sentry.service.client.server.rpc-port</name>
<value>3893</value>
</property>
<property>
<name>sentry.service.client.server.rpc-address</name>
<value>hostname</value>
</property>
<property>
<name>sentry.service.client.server.rpc-connection-timeout</name>
<value>200000</value>
</property>
<property>
<name>sentry.service.security.mode</name>
<value>none</value>
</property>

Other configuration changes required include:


To enable the Sentry policy service, the following flag should be set on the catalogd and the impalad.
--sentry_config=<absolute path to sentry service configuration file>

To enable authorization based on policy server metadata set the following flag on the impalad.
--server_name=<server name>

To enable authorization based on a file-based policy set the following flags on the impalad.
--server_name=<server name>
--authorization_policy_file=<path to policy file>

If the --authorization_policy_file flag is set, Impala will use the policy file-based approach. Otherwise,
the policy server metadata approach will be used to implement authorization.
The impala user also needs to be added to list of administrative users of the Sentry Policy Server. For more
details, see SENTRY-191.

CDH 5 Security Guide | 65

Sentry Service Configuration

Appendix: Authorization Privilege Model for Hive and Impala


Privileges can be granted on different objects in the Hive warehouse. Any privilege that can be granted is
associated with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the
base object automatically inherits it. For instance, if a user has ALL privileges on the database scope, then (s)he
has ALL privileges on all of the base objects contained within that scope.

Object Hierarchy in Hive


Server
Database
Table
Partition
Columns
View
Index
Function/Routine
Lock

Table 6: Valid privilege types and objects they apply to


Privilege

Object

INSERT

DB, TABLE

SELECT

DB, TABLE

ALL

SERVER, TABLE, DB, URI

Table 7: Privilege hierarchy


Base Object

Granular privileges on
object

Container object that


contains the base object

Privileges on container
object that implies
privileges on the base
object

DATABASE

ALL

SERVER

ALL

TABLE

INSERT

DATABASE

ALL

TABLE

SELECT

DATABASE

ALL

VIEW

SELECT

DATABASE

ALL

Table 8: Privilege table for Hive & Impala operations


Operation

Scope

Privileges

CREATE DATABASE

SERVER

ALL

DROP DATABASE

DATABASE

ALL

CREATE TABLE

DATABASE

ALL

DROP TABLE

TABLE

ALL

CREATE VIEW

DATABASE; SELECT on ALL


TABLE

DROP VIEW

VIEW/TABLE

ALL

CREATE INDEX

TABLE

ALL

66 | CDH 5 Security Guide

URI

Others

SELECT on TABLE

Sentry Service Configuration


Operation

Scope

Privileges

DROP INDEX

TABLE

ALL

ALTER TABLE .. ADD


COLUMNS

TABLE

ALL

ALTER TABLE .. REPLACE TABLE


COLUMNS

ALL

ALTER TABLE .. CHANGE TABLE


column

ALL

ALTER TABLE .. RENAME TABLE

ALL

ALTER TABLE .. SET


TBLPROPERTIES

TABLE

ALL

ALTER TABLE .. SET


FILEFORMAT

TABLE

ALL

ALTER TABLE .. SET


LOCATION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION location

TABLE

ALL

ALTER TABLE .. DROP


PARTITION

TABLE

ALL

ALTER TABLE ..
PARTITION SET
FILEFORMAT

TABLE

ALL

SHOW TBLPROPERTIES

TABLE

SELECT/INSERT

SHOW CREATE TABLE

TABLE

SELECT/INSERT

SHOW PARTITIONs

TABLE

SELECT/INSERT

DESCRIBE TABLE

TABLE

SELECT/INSERT

DESCRIBE TABLE ..
PARTITION

TABLE

SELECT/INSERT

LOAD DATA

TABLE

INSERT

SELECT

TABLE

SELECT

INSERT OVERWRITE
TABLE

TABLE

INSERT

CREATE TABLE .. AS
SELECT

DATABASE; SELECT on ALL


TABLE

USE <dbName>

Any

ALTER TABLE .. SET


SERDEPROPERTIES

TABLE

ALL

ALTER TABLE ..
PARTITION SET
SERDEPROPERTIES

TABLE

ALL

URI

Others

URI

URI

URI

SELECT on TABLE

CDH 5 Security Guide | 67

Sentry Service Configuration


Operation

Scope

Privileges

URI

INSERT OVERWRITE
DIRECTORY

TABLE

INSERT

URI

Analyze TABLE

TABLE

SELECT + INSERT

IMPORT TABLE

DATABASE

ALL

URI

EXPORT TABLE

TABLE

SELECT

URI

ALTER TABLE TOUCH

TABLE

ALL

ALTER TABLE TOUCH


PARTITION

TABLE

ALL

ALTER TABLE ..
CLUSTERED BY SORTED
BY

TABLE

ALL

ALTER TABLE ..
ENABLE/DISABLE

TABLE

ALL

ALTER TABLE ..
PARTITION
ENABLE/DISABLE

TABLE

ALL

ALTER TABLE ..
TABLE
PARTITION.. RENAME TO
PARTITION

ALL

ALTER DATABASE

DATABASE

ALL

DESCRIBE DATABASE

DATABASE

SELECT/INSERT

SHOW COLUMNS

TABLE

SELECT/INSERT

SHOW INDEXES

TABLE

SELECT/INSERT

GRANT PRIVILEGE

Allowed only for


Sentry admin users

REVOKE PRIVILEGE

Allowed only for


Sentry admin users

SHOW GRANTS

Allowed only for


Sentry admin users

ADD JAR

Not Allowed

ADD FILE

Not Allowed

DFS

Not Allowed

Hive-Only Operations

Impala-Only Operations
EXPLAIN

TABLE

SELECT

INVALIDATE METADATA

SERVER

ALL

INVALIDATE METADATA
<table name>

TABLE

SELECT/INSERT

REFRESH <table name>

TABLE

SELECT/INSERT

CREATE FUNCTION

SERVER

ALL

68 | CDH 5 Security Guide

Others

Sentry Service Configuration


Operation

Scope

Privileges

DROP FUNCTION

SERVER

ALL

COMPUTE STATS

TABLE

ALL

URI

Others

CDH 5 Security Guide | 69

Flume Security Configuration

Flume Security Configuration


Flume agents have the ability to store data on an HDFS filesystem configured with Hadoop security. The Kerberos
system and protocols authenticate communications between clients and services. Hadoop clients include users
and MapReduce jobs on behalf of users, and the services include HDFS and MapReduce. Flume acts as a Kerberos
principal (user) and needs Kerberos credentials to interact with the Kerberos security-enabled service.
Authenticating a user or a service can be done using a Kerberos keytab file. This file contains a key that is used
to obtain a ticket-granting ticket (TGT). The TGT is used to mutually authenticate the client and the service via
the Kerberos KDC.
The following sections describe how to use Flume 1.3.x and CDH 5 with Kerberos security on your Hadoop cluster:

Configuring Flume's Security Properties on page 71


Flume Account Requirements on page 73
Testing the Flume HDFS Sink Configuration on page 73
Writing to a Secure HBase cluster on page 73
Important:
To enable Flume to work with Kerberos security on your Hadoop cluster, make sure you perform the
installation and configuration steps in Configuring Hadoop Security in CDH 5.

Note:
These instructions have been tested with CDH 5 and MIT Kerberos 5 only. The following instructions
describe an example of how to configure a Flume agent to be a client as the user flume to a secure
HDFS service. This section does not describe how to secure the communications between Flume
agents, which is not currently implemented.

Configuring Flume's Security Properties


Writing as a single user for all HDFS sinks in a given Flume agent
The Hadoop services require a three-part principal that has the form of
username/[email protected]. Cloudera recommends using flume as the first
component and the fully qualified domain name of the host machine as the second. Assuming that Kerberos
and security-enabled Hadoop have been properly configured on the Hadoop cluster itself, you must add the
following parameters to the Flume agent's flume.conf configuration file, which is typically located at
/etc/flume-ng/conf/flume.conf:
agentName.sinks.sinkName.hdfs.kerberosPrincipal =
flume/[email protected]
agentName.sinks.sinkName.hdfs.kerberosKeytab = /etc/flume-ng/conf/flume.keytab

where:
agentName is the name of the Flume agent being configured, which in this release defaults to the value "agent".
sinkName is the name of the HDFS sink that is being configured. The respective sink's type must be HDFS.

In the previous example, flume is the first component of the principal name, fully.qualified.domain.name
is the second, and YOUR-REALM.COM is the name of the Kerberos realm your Hadoop cluster is in. The
/etc/flume-ng/conf/flume.keytab file contains the keys necessary for
flume/[email protected] to authenticate with other services.

CDH 5 Security Guide | 71

Flume Security Configuration


Flume and Hadoop also provide a simple keyword, _HOST, that gets expanded to be the fully qualified domain
name of the host machine where the service is running. This allows you to have one flume.conf file with the
same hdfs.kerberosPrincipal value on all of your agent host machines.
agentName.sinks.sinkName.hdfs.kerberosPrincipal = flume/[email protected]

Writing as different users across multiple HDFS sinks in a single Flume agent
In this release, support has been added for secure impersonation of Hadoop users (similar to "sudo" in UNIX).
This is implemented in a way similar to how Oozie implements secure user impersonation.
The following steps to set up secure impersonation from Flume to HDFS assume your cluster is configured using
Kerberos. (However, impersonation also works on non-Kerberos secured clusters, and Kerberos-specific aspects
should be omitted in that case.)
1. Configure Hadoop to allow impersonation. Add the following configuration properties to your core-site.xml.
<property>
<name>hadoop.proxyuser.flume.groups</name>
<value>group1,group2</value>
<description>Allow the flume user to impersonate any members of group1 and
group2</description>
</property>
<property>
<name>hadoop.proxyuser.flume.hosts</name>
<value>host1,host2</value>
<description>Allow the flume user to connect only from host1 and host2 to
impersonate a user</description>
</property>

2.

3.
4.
5.
6.

You can use the wildcard character * to enable impersonation of any user from any host. For more information,
see Secure Impersonation.
Set up a Kerberos keytab for the Kerberos principal and host Flume is connecting to HDFS from. This user
must match the Hadoop configuration in the preceding step. For instructions, see Configuring Hadoop Security
in CDH 5.
Configure the HDFS sink with the following configuration options:
hdfs.kerberosPrincipal - fully-qualified principal. Note: _HOST will be replaced by the hostname of the
local machine (only in-between the / and @ characters)
hdfs.kerberosKeytab - location on the local machine of the keytab containing the user and host keys for
the above principal
hdfs.proxyUser - the proxy user to impersonate

Example snippet (the majority of the HDFS sink configuration options have been omitted):
agent.sinks.sink-1.type = HDFS
agent.sinks.sink-1.hdfs.kerberosPrincipal = flume/[email protected]
agent.sinks.sink-1.hdfs.kerberosKeytab = /etc/flume-ng/conf/flume.keytab
agent.sinks.sink-1.hdfs.proxyUser = weblogs
agent.sinks.sink-2.type = HDFS
agent.sinks.sink-2.hdfs.kerberosPrincipal = flume/[email protected]
agent.sinks.sink-2.hdfs.kerberosKeytab = /etc/flume-ng/conf/flume.keytab
agent.sinks.sink-2.hdfs.proxyUser = applogs

In the above example, the flume Kerberos principal impersonates the user weblogs in sink-1 and the user
applogs in sink-2. This will only be allowed if the Kerberos KDC authenticates the specified principal (flume
in this case), and the if NameNode authorizes impersonation of the specified proxy user by the specified principal.

Limitations
At this time, Flume does not support using multiple Kerberos principals or keytabs in the same agent. Therefore,
if you want to create files as multiple users on HDFS, then impersonation must be configured, and exactly one
72 | CDH 5 Security Guide

Flume Security Configuration


principal must be configured in Hadoop to allow impersonation of all desired accounts. In addition, the same
keytab path must be used across all HDFS sinks in the same agent. If you attempt to configure multiple principals
or keytabs in the same agent, Flume will emit the following error message:
Cannot use multiple kerberos principals in the same agent. Must restart agent to use
new principal or keytab.

Flume Account Requirements


This section provides an overview of the account and credential requirements for Flume to write to a Kerberized
HDFS. Note the distinctions between the Flume agent machine, DataNode machine, and NameNode machine,
as well as the flume Unix user account versus the flume Hadoop/Kerberos user account.
Each Flume agent machine that writes to HDFS (via a configured HDFS sink) needs a Kerberos principal of
the form:
flume/[email protected]

where fully.qualified.domain.name is the fully qualified domain name of the given Flume agent host
machine, and YOUR-REALM.COM is the Kerberos realm.
Each Flume agent machine that writes to HDFS does not need to have a flume Unix user account to write
files owned by the flume Hadoop/Kerberos user. Only the keytab for the flume Hadoop/Kerberos user is
required on the Flume agent machine.
DataNode machines do not need Flume Kerberos keytabs and also do not need the flume Unix user account.
TaskTracker (MRv1) or NodeManager (YARN) machines need a flume Unix user account if and only if
MapReduce jobs are being run as the flume Hadoop/Kerberos user.
The NameNode machine needs to be able to resolve the groups of the flume user. The groups of the flume
user on the NameNode machine are mapped to the Hadoop groups used for authorizing access.
The NameNode machine does not need a Flume Kerberos keytab.

Testing the Flume HDFS Sink Configuration


To test whether your Flume HDFS sink is properly configured to connect to your secure HDFS cluster, you must
run data through Flume. An easy way to do this is to configure a Netcat source, a Memory channel, and an HDFS
sink. Start Flume with that configuration, and use the nc command (available freely online and with many UNIX
distributions) to send events to the Netcat source port. The resulting events should appear on HDFS in the
configured location. If the events do not appear, check the Flume log at /var/log/flume-ng/flume.log for
any error messages related to Kerberos.

Writing to a Secure HBase cluster


If you want to write to a secure HBase cluster, be aware of the following:
Flume must be configured to use Kerberos security as documented above, and HBase must be configured
to use Kerberos security as documented in HBase Security Configuration.
The hbase-site.xml file, which must be configured to use Kerberos security, must be in Flume's classpath
or HBASE_HOME/conf.
HBaseSink org.apache.flume.sink.hbase.HBaseSink supports secure HBase, but AsyncHBaseSink
org.apache.flume.sink.hbase.AsyncHBaseSink does not.
The Flume HBase Sink takes these two parameters:

CDH 5 Security Guide | 73

Flume Security Configuration


kerberosPrincipal specifies the Kerberos principal to be used
kerberosKeytab specifies the path to the Kerberos keytab These are defined as:
agent.sinks.hbaseSink.kerberosPrincipal =
flume/[email protected]
agent.sinks.hbaseSink.kerberosKeytab = /etc/flume-ng/conf/flume.keytab

If HBase is running with the AccessController coprocessor, the flume user (or whichever user the agent is
running as) must have permissions to write to the same table and the column family that the sink is configured
to write to. You can grant permissions using the grant command from HBase shell as explained in HBase
Security Configuration.
The Flume HBase Sink does not currently support impersonation; it will write to HBase as the user the agent
is being run as.
If you want to use HDFS Sink and HBase Sink to write to HDFS and HBase from the same agent respectively,
both sinks have to use the same principal and keytab. If you want to use different credentials, the sinks have
to be on different agents.
Each Flume agent machine that writes to HBase (via a configured HBase sink) needs a Kerberos principal of
the form:
flume/[email protected]

where fully.qualified.domain.name is the fully qualified domain name of the given Flume agent host
machine, and YOUR-REALM.COM is the Kerberos realm.

74 | CDH 5 Security Guide

Hue Security Configuration

Hue Security Configuration


The following sections describe how to configure Hue CDH 5 with Kerberos security, enabling single sign-on
with SAML and encrypting communication between Hue and other services among other available configuration
settings.

Hue Security Enhancements on page 75


Configuring Hue to Support Hadoop Security using Kerberos on page 76
Integrating Hue with LDAP on page 79
Configuring Hue for SAML on page 83
Important:
To enable Hue to work with Kerberos security on your Hadoop cluster, make sure you perform the
installation and configuration steps in Configuring Hadoop Security in CDH 5.

Hue Security Enhancements


Enabling SSL Communication with HiveServer2
By providing a CA certificate, private key, and public certificate, Hue can communicate with HiveServer2 over
SSL. You can now configure the following properties in the [beeswax] section under [[ssl]] in the Hue
configuration file, hue.ini.
enabled

Choose to enable/disable SSL communication for this server.


Default: false

cacerts

Path to Certificate Authority certificates.


Default: /etc/hue/cacerts.pem

key

Path to the private key file.


Default: /etc/hue/key.pem

cert

Path to the public certificate file.


Default: /etc/hue/cert.pem

validate

Choose whether Hue should validate certificates received from the server.
Default: true

Secure Database Connection


Connections vary depending on the database. Hue uses different clients to communicate with each database
internally. They all specify a common interface known as the DBAPI version 2 interface. Client specific options,
such as secure connectivity, can be passed through the interface. For example, for MySQL you can enable SSL
communication by specifying the options configuration property under the desktop>[[database]] section
in hue.ini.
[desktop]
[[databases]]

CDH 5 Security Guide | 75

Hue Security Configuration

options={"ssl":{"ca":"/tmp/ca-cert.pem"}}

Session Timeout
Session timeouts can be set by specifying the ttl configuration property under the [desktop]>[[session]]
section in hue.ini.
ttl

The cookie containing the users' session ID will expire after this amount
of time in seconds.
Default: 60*60*24*14

Secure Cookies
Secure session cookies can be enable by specifying the secure configuration property under the
[desktop]>[[session]] section in hue.ini. Additionally, you can set the http-only flag for cookies containing
users' session IDs.
secure

The cookie containing the users' session ID will be secure. Should only be
enabled with HTTPS.
Default: false

http-only

The cookie containing the users' session ID will use the HTTP only flag.
Default: false

Allowed HTTP Methods


You can specify the HTTP request methods that the server should respond to using the http_allowed_methods
property under the [desktop] section in hue.ini.
http_allowed_methods

Default: options,get,head,post,put,delete,connect

Restricting the Cipher List


Cipher list support with HTTPS can be restricted by specifying the ssl_cipher_list configuration property
under the [desktop] section in hue.ini.
ssl_cipher_list

Default: !aNULL:!eNULL:!LOW:!EXPORT:!SSLv2

URL Redirect Whitelist


Restrict the domains or pages to which Hue can redirect users. The redirect_whitelist property can be found
under the [desktop] section in hue.ini.
redirect_whitelist

For example, to restrict users to your local domain and FQDN, the following
value can be used:
^\/.*$,^https:\/\/fanyv88.com:443\/http\/www.mydomain.com\/.*$

Configuring Hue to Support Hadoop Security using Kerberos


You can configure Hue in CDH 5 to support Hadoop security on a cluster using Kerberos.

76 | CDH 5 Security Guide

Hue Security Configuration


To configure the Hue server to support Hadoop security using Kerberos:
1. Create a Hue user principal in the same realm as the Hadoop cluster of the form:
kadmin: addprinc -randkey hue/[email protected]

where: hue is the principal the Hue server is running as, hue.server.fully.qualified.domain.name is
the fully-qualified domain name (FQDN) of your Hue server, YOUR-REALM.COM is the name of the Kerberos
realm your Hadoop cluster is in
2. Create a keytab file for the Hue principal using the same procedure that you used to create the keytab for
the hdfs or mapred principal for a specific host. You should name this file hue.keytab and put this keytab
file in the directory /etc/hue on the machine running the Hue server. Like all keytab files, this file should
have the most limited set of permissions possible. It should be owned by the user running the hue server
(usually hue) and should have the permission 400.
3. To test that the keytab file was created properly, try to obtain Kerberos credentials as the Hue principal using
only the keytab file. Substitute your FQDN and realm in the following command:
$ kinit -k -t /etc/hue/hue.keytab
hue/[email protected]

4. In the /etc/hue/hue.ini configuration file, add the following lines in the sections shown. Replace the
kinit_path value, /usr/kerberos/bin/kinit, shown below with the correct path on the user's system.
[desktop]
[[kerberos]]
# Path to Hue's Kerberos keytab file
hue_keytab=/etc/hue/hue.keytab
# Kerberos principal name for Hue
hue_principal=hue/FQDN@REALM
# add kinit path for non root users
kinit_path=/usr/kerberos/bin/kinit
[beeswax]
# If Kerberos security is enabled, use fully-qualified domain name (FQDN)
## hive_server_host=<FQDN of Hive Server>
# Hive configuration directory, where hive-site.xml is located
## hive_conf_dir=/etc/hive/conf
[impala]
## server_host=localhost
## impala_principal=impala/impalad.hostname.domainname.com
[search]
# URL of the Solr Server
## solr_url=http://localhost:8983/solr/
# Requires FQDN in solr_url if enabled
## security_enabled=false
[hadoop]
[[hdfs_clusters]]
[[[default]]]
# Enter the host and port on which you are running the Hadoop NameNode
namenode_host=FQDN
hdfs_port=8020
http_port=50070
security_enabled=true
# Thrift plugin port for the name node
## thrift_port=10090
# Configuration for YARN (MR2)
# -----------------------------------------------------------------------[[yarn_clusters]]
[[[default]]]

CDH 5 Security Guide | 77

Hue Security Configuration


# Enter the host on which you are running the ResourceManager
## resourcemanager_host=localhost
# Change this if your YARN cluster is Kerberos-secured
## security_enabled=false
# Thrift plug-in port for the JobTracker
## thrift_port=9290
[liboozie]
# The URL where the Oozie service runs on. This is required in order for users
to submit jobs.
## oozie_url=http://localhost:11000/oozie
# Requires FQDN in oozie_url if enabled
## security_enabled=false

Important:
In the /etc/hue/hue.ini file, verify the following:
Make sure the jobtracker_host property is set to the fully-qualified domain name of the
host running the JobTracker. The JobTracker host name must be fully-qualified in a secured
environment.
Make sure the fs.defaultfs property under each [[hdfs_clusters]] section contains the
fully-qualified domain name of the file system access point, which is typically the NameNode.
Make sure the hive_conf_dir property under the [beeswax] section points to a directory
containing a valid hive-site.xml (either the original or a synced copy).
Make sure the FQDN specified for HiveServer2 is the same as the FQDN specified for the
hue_principal configuration property. Without this, HiveServer2 will not work with security
enabled.
5. In the /etc/hadoop/conf/core-site.xml configuration file on all of your cluster nodes, add the following
lines:
<!-- Hue security configuration -->
<property>
<name>hue.kerberos.principal.shortname</name>
<value>hue</value>
</property>
<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value> <!-- A group which all users of Hue belong to, or the wildcard
value "*" -->
</property>
<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>hue.server.fully.qualified.domain.name</value>
</property>

Important:
Make sure you change the /etc/hadoop/conf/core-site.xml configuration file on all of your
cluster nodes.
6. If Hue is configured to communicate to Hadoop via HttpFS, then you must add the following properties to
httpfs-site.xml:
<property>
<name>httpfs.proxyuser.hue.hosts</name>
<value>fully.qualified.domain.name</value>
</property>
<property>

78 | CDH 5 Security Guide

Hue Security Configuration


<name>httpfs.proxyuser.hue.groups</name>
<value>*</value>
</property>

7. Add the following properties to the Oozie server oozie-site.xml configuration file in the Oozie configuration
directory:
<property>
<name>oozie.service.ProxyUserService.proxyuser.hue.hosts</name>
<value>*</value>
</property>
<property>
<name>oozie.service.ProxyUserService.proxyuser.hue.groups</name>
<value>*</value>
</property>

8. Restart the JobTracker to load the changes from the core-site.xml file.
$ sudo service hadoop-0.20-mapreduce-jobtracker restart

9. Restart Oozie to load the changes from the oozie-site.xml file.


$ sudo service oozie restart

10. Restart the NameNode, JobTracker, and all DataNodes to load the changes from the core-site.xml file.
$ sudo service hadoop-0.20-(namenode|jobtracker|datanode) restart

Integrating Hue with LDAP


When Hue is integrated with LDAP users can use their existing credentials to authenticate and inherit their
existing groups transparently. There is no need to save or duplicate any employee password in Hue. There are
several other ways to authenticate with Hue such as PAM, SPNEGO, OpenID, OAuth, SAML2 and so on. This topic
details how you can configure Hue to authenticate against an LDAP directory server.
When authenticating via LDAP, Hue validates login credentials against an LDAP directory service if configured
with the LDAP authentication backend:
[desktop]
[[auth]]
backend=desktop.auth.backend.LdapBackend

The LDAP authentication backend will automatically create users that dont exist in Hue by default. Hue needs
to import users in order to properly perform the authentication. Passwords are never imported when importing
users. If you want to disable automatic import set the create_users_on_login property under the [desktop]
> [[ldap]] section of hue.ini to false.
[desktop]
[[ldap]]
create_users_on_login=false

The purpose of disabling the automatic import is to allow only a predefined list of manually imported users to
login.
There are two ways to authenticate with a directory service through Hue:
Search Bind
Direct Bind

CDH 5 Security Guide | 79

Hue Security Configuration


You can specify the authentication mechanism using the search_bind_authentication property under the
[desktop] > [[ldap]] section of hue.ini.
search_bind_authentication

Uses search bind authentication by default. Set this property to false to


use direct bind authentication.
Default: true

Search Bind
The search bind mechanism for authenticating will perform an ldapsearch against the directory service and
bind using the found distinguished name (DN) and password provided. This is the default method of authentication
used by Hue with LDAP.
The following configuration properties under the [desktop] > [[ldap]] > [[[users]]] section in hue.ini
can be set to restrict the search process.
user_filter

General LDAP filter to restrict the search.


Default: "objectclass=*"

user_name_attr

The attribute that will be considered the username to be searched against.


Typical attributes to search for include: uid, sAMAccountName.
Default: sAMAccountName

With the above configuration, the LDAP search filter will take on the form:
(&(objectClass=*)(sAMAccountName=<user entered usename>))

Important: Setting search_bind_authentication=true in hue.ini tells Hue to perform an LDAP


search using the bind credentials specified for the bind_dn and bind_password configuration
properties. Hue will start searching the subtree starting from the base DN specified for the base_dn
property. It will then search the base DN for an entry whose attribute, specified in user_name_attr,
has the same value as the short name provided on login. The search filter, defined in user_filter
will also be used to limit the search.
Direct Bind
The direct bind mechanism for authenticating will bind to the LDAP server using the username and password
provided at login.
The following configuration properties can be used to determine how Hue binds to the LDAP server. These can
be set under the [desktop] > [[ldap]] section of hue.ini.
nt_domain

The NT domain to connect to (only for use with Active Directory). This
AD-specific property allows Hue to authenticate with AD without having
to follow LDAP references to other partitions. This typically maps to the
email address of the user or the user's ID in conjunction with the domain.
If provided, Hue will use User Principal Names (UPNs) to bind to the LDAP
service.
Default: mycompany.com

ldap_username_pattern

Provides a template for the DN that will ultimately be sent to the directory
service when authenticating. The <username> parameter will be replaced
with the username provided at login.
Default: "uid=<username>,ou=People,dc=mycompany,dc=com"

80 | CDH 5 Security Guide

Hue Security Configuration


Important: Setting search_bind_authentication=false in hue.ini tells Hue to perform a direct
bind to LDAP using the credentials provided (not bind_dn and bind_password specified in hue.ini).
There are two ways direct bind works depending on whether the nt_domain property is specified in
hue.ini:
nt_domain is specified: This is used to connect to an Active Directory service. In this case, the User
Principal Name (UPN) is used to perform a direct bind. Hue forms the UPN by concatenating the
short name provided at login with the nt_domain. For example, <short name>@<nt_domain>.
The ldap_username_pattern property is ignored.
nt_domain is not specified: This is used to connect to all other directory services (can handle Active
Directory, but nt_domain is the preferred way for AD). In this case, ldap_username_pattern is
used and it should take on the form cn=<username>,dc=example,dc=com where <username>
will be replaced with the username provided at login.

Importing LDAP Users and Groups


If an LDAP user needs to be part of a certain group and be given a particular set of permissions, you can import
this user with the User Admin interface in Hue.

Groups can also be imported using the User Admin interface, and users can be added to this group. As in the
image below, not only can groups be discovered via DN and rDN search, but users that are members of the group
or members of its subordinate groups can be imported as well.

CDH 5 Security Guide | 81

Hue Security Configuration

You have the following options available when importing a user/group:


Distinguished name: If checked, the username provided must be a full distinguished name (for example,
uid=hue,ou=People,dc=gethue,dc=com). Otherwise, the Username provided should be a fragment of a
Relative Distinguished Name (rDN) (for example, the username hue maps to the rDN uid=hue). Hue will
perform an LDAP search using the same methods and configurations as described above. That is, Hue will
take the provided username and create a search filter using the user_filter and user_name_attr
configurations.
Create home directory: If checked, when the user is imported, their home directory in HDFS will automatically
be created if it doesnt already exist.
Important: When managing LDAP entries, the User Admin app will always perform an LDAP search
and will always use bind_dn, bind_password, base_dn, as defined in hue.ini.

Synchronizing LDAP Users and Groups


Users and groups can be synchronized with the directory service via the User Admin interface or via a command
line utility. The image from the Importing LDAP Users and Groups section uses the words Add/Sync to indicate
that when a user or group that already exists in Hue is being added, it will in fact be synchronized instead. In
the case of importing users for a particular group, new users will be imported and existing users will be
synchronized.
Note: Users that have been deleted from the directory service will not be deleted from Hue. Those
users can be manually deactivated from Hue via the User Admin interface.
Attributes Synchronized
Currently, only the first name, last name, and email address are synchronized. Hue looks for the LDAP attributes
givenName, sn, and mail when synchronizing. The user_name_attr configuration property is used to
appropriately choose the username in Hue. For instance, if user_name_attr is set to uid, then the "uid"
returned by the directory service will be used as the username of the user in Hue.

82 | CDH 5 Security Guide

Hue Security Configuration


User Admin interface
The Sync LDAP users/groups button in the User Admin interface will automatically synchronize all users and
groups.
Synchronize Using a Command-Line Interface
For example, to synchronize users and groups using a command-line interface:
<hue root>/build/env/bin/hue sync_ldap_users_and_groups

LDAPS/StartTLS support
Secure communication with LDAP is provided using the SSL/TLS and StartTLS protocols. They allow Hue to
validate the directory service it is going to converse with. Hence, if a Certificate Authority certificate file is provided,
Hue will communicate using LDAPS. You can specify the path to the CA certificate under :
[desktop]
[[ldap]]
ldap_cert=/etc/hue/ca.crt

The StartTLS protocol can be used as well:


[desktop]
[[ldap]]
use_start_tls=true

Configuring Hue for SAML


This section describes the configuration changes required to use Hue with SAML 2.0 (Security Assertion Markup
Language) to enable single sign-on (SSO) authentication.
The SAML 2.0 Web Browser SSO profile has three components: a Security Provider, a User Agent and an Identity
Provider. In this case, Hue is the Service Provider (SP), you can use an Identity Provider (IdP) of your choice, and
you are the user acting through your browser (User Agent). When a user requests access to an application, Hue
uses your browser to send an authentication request to the Identity Provider which then authenticates the user
and redirects them back to Hue .
This blog post guides users through setting up SSO with Hue, using the SAML backend and Shibboleth as the
Identity Provider.
Note: The following instructions assume you already have an Identity Provider set up and running.

Step 1: Install swig and openssl packages


Install swig and openssl. For example, on RHEL systems, use the following commands:
yum install swig
yum install openssl

Step 2: Install libraries to support SAML in Hue


Install the djangosaml2 and pysaml2 libraries to support SAML in Hue. These libraries are dependent on the
xmlsec1 package to be installed and available on the machine for Hue to use. Follow these instructions to install
the xmlsec1 package on your system.
RHEL, CentOS and SLES:
CDH 5 Security Guide | 83

Hue Security Configuration


For RHEL, CentOS and SLES systems, the xmlsec1 package is available for download from the EPEL repository.
In order to install packages from the EPEL repository, first download the appropriate the rpm package to your
machine, substituting the version in the package URL with the one required for your system. For example, use
the following commands for CentOS 5 or RHEL 5:
rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
yum install xmlsec1

Oracle Linux:
For Oracle Linux systems, download the xmlsec1 package from http://www.aleksey.com/xmlsec/ and execute
the following commands:
tar -xvzf xmlsec1-<version>.tar.gz
cd xmlsec1-<version>
./configure && make
sudo make install

Important: The xmlsec1 package must be executable by the user running Hue.
You should now be able to install djangosaml and pysaml2 on your machines.
build/env/bin/pip install -e git+https://github.com/abec/pysaml2@HEAD#egg=pysaml2
build/env/bin/pip install -e git+https://github.com/abec/djangosaml2@HEAD#egg=djangosaml2

Step 3: Update the Hue configuration file


Several configuration parameters need to be updated in Hue's configuration file, hue.ini to enable support for
SAML. The table given below describes the available parameters for SAML in hue.ini under the [libsaml]
section.
Parameter

Description

xmlsec_binary

This is a path to the xmlsec_binary, an executable used to sign, verify, encrypt


and decrypt SAML requests and assertions. This program should be executable
by the user running Hue.

create_users_on_login

Create Hue users received in assertion response upon successful login. The
value for this parameter can be either "true" or "false".

required_attributes

Attributes Hue asks for from the IdP. This is a comma-separated list of
attributes. For example, uid, email and so on.

optional_attributes

Optional attributes Hue can ask for from the IdP. Also a comma-separated
list of attributes.

metadata_file

This is a path to the IdP metadata copied to a local file. This file should be
readable.

key_file

Path to the private key used to encrypt the metadata. File format .PEM

cert_file

Path to the X.509 certificate to be sent along with the encrypted metadata.
File format .PEM

user_attribute_mapping

Mapping from attributes received from the IdP to the Hue's django user
attributes. For example, {'uid':'username', 'email':'email'}.

logout_requests_signed

Have Hue initiated logout requests be signed and provide a certificate.

84 | CDH 5 Security Guide

Hue Security Configuration


Step 3a: Update the SAML metadata file
Update the metadata file pointed to by your Hue configuration file, hue.ini. Check your IdP documentation for
details on how to procure the XML metadata and paste it into the <metadata_file_name>.xml file at the
location specified by the configuration parameter metadata_file.
For example, if you were using the Shibboleth IdP, you would visit https://<IdPHOST>:8443/idp/shibboleth,
copy the metadata content available there and paste it into the Hue metadata file.
Note:
You may have to edit the content copied over from your IdP's metadata file in case of missing fields
such as port numbers (8443), from URLs that point to the IdP.

Step 3b: Private key and certificate files


To enable Hue to communicate with the IdP, you will need to specify the location of a private key, for the, key_file
property, that can be used to sign requests sent to the IdP. You will also need to specify the location of the
certificate file, for the cert_pem property, which you will use to verify and decrypt messages from the IdP.
Note: The key and certificate files specified by the key_file and cert_file parameters must be
.PEM files.

Step 3c: Configure Hue to use SAML Backend


To enable SAML to allow user logins and create users, update the backend configuration property in hue.ini
to use the SAML authentication backend. You will find the backend property in the [[auth]] sub-section under
[desktop].
backend=libsaml.backend.SAML2Backend

Here is an example configuration of the [libsaml] section from hue.ini.


xmlsec_binary=/usr/local/bin/xmlsec1
create_users_on_login=true
metadata_file=/etc/hue/saml/metadata.xml
key_file=/etc/hue/saml/key.pem
cert_file=/etc/hue/saml/cert.pem
logout_requests_signed=true

Step 4: Restart the Hue server


Use the following command to restart the Hue server.
sudo service hue restart

CDH 5 Security Guide | 85

Oozie Security Configuration

Oozie Security Configuration


This section describes how to configure Oozie CDH 5 with Kerberos security on a Hadoop cluster:

Configuring the Oozie Server to Support Kerberos Security on page 87


Configuring Oozie HA with Kerberos on page 88
Appendix H - Using a Web Browser to Access an URL Protected by Kerberos HTTP SPNEGO on page 207
Configuring Oozie to use SSL (HTTPS) on page 89
Important:
To enable Oozie to work with Kerberos security on your Hadoop cluster, make sure you perform the
installation and configuration steps in Configuring Hadoop Security in CDH 5. Also note that when
Kerberos security is enabled in Oozie, a web browser that supports Kerberos HTTP SPNEGO is required
to access the Oozie web-console (for example, Firefox, Internet Explorer or Chrome).

Important:
If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager,
NodeManagers, HttpFS, or Oozie services are configured to use Kerberos HTTP SPNEGO authentication,
and two or more of these services are running on the same host, then all of the running services must
use the same HTTP principal and keytab file used for their HTTP endpoints.

Configuring the Oozie Server to Support Kerberos Security


1. Create a Oozie service user principal using the syntax:
oozie/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate with the
Hadoop cluster. where: fully.qualified.domain.name is the host where the Oozie server is running
YOUR-REALM is the name of your Kerberos realm.
kadmin: addprinc -randkey oozie/[email protected]

2. Create a HTTP service user principal using the syntax:


HTTP/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate user
requests coming to the Oozie web-services. where: fully.qualified.domain.name is the host where the
Oozie server is running YOUR-REALM is the name of your Kerberos realm.
kadmin: addprinc -randkey HTTP/[email protected]

Important:
The HTTP/ component of the HTTP service user principal must be upper case as shown in the
syntax and example above.
3. Create keytab files with both principals.
$ kadmin
kadmin: xst -k oozie.keytab oozie/fully.qualified.domain.name
kadmin: xst -k http.keytab HTTP/fully.qualified.domain.name

CDH 5 Security Guide | 87

Oozie Security Configuration


4. Merge the two keytab files into a single keytab file:
$ ktutil
ktutil: rkt oozie.keytab
ktutil: rkt http.keytab
ktutil: wkt oozie-http.keytab

5. Test that credentials in the merged keytab file work. For example:
$ klist -e -k -t oozie-http.keytab

6. Copy the oozie-http.keytab file to the Oozie configuration directory. The owner of the oozie-http.keytab
file should be the oozie user and the file should have owner-only read permissions.
7. Edit the Oozie server oozie-site.xml configuration file in the Oozie configuration directory by setting the
following properties:
Important: You must restart the Oozie server to have the configuration changes take effect.
Property

Value

oozie.service.HadoopAccessorService.kerberos.enabled true
local.realm

<REALM>

oozie.service.HadoopAccessorService.keytab.file /etc/oozie/conf/oozie-http.keytab for a

package installation, or
<EXPANDED_DIR>/conf/oozie-http.keytab for a

tarball installation
oozie.service.HadoopAccessorService.kerberos.principal oozie/<fully.qualified.domain.name>@<YOUR-REALM.COM>
oozie.authentication.type

kerberos

oozie.authentication.kerberos.principal

HTTP/<fully.qualified.domain.name>@<YOUR-REALM.COM>

oozie.authentication.kerberos.name.rules

Use the value configured for


hadoop.security.auth_to_local in
core-site.xml

Configuring Oozie HA with Kerberos


In CDH 5, you can configure multiple active Oozie servers against the same database, providing high availability
for Oozie. For instructions on setting up Oozie HA, see About Oozie High Availability
Let's assume you have three hosts running Oozie servers, host1.example.com, host2.example.com,
host3.example.com and the Load Balancer running on oozie.example.com. The Load Balancer directs traffic
to the Oozie servers: host1, host2 and host3. For such a configuration, assuming your Kerberos realm is
EXAMPLE.COM, create the following Kerberos principals:

oozie/[email protected]
oozie/[email protected]
oozie/[email protected]
HTTP/[email protected]
HTTP/[email protected]
HTTP/[email protected]
HTTP/[email protected]

88 | CDH 5 Security Guide

Oozie Security Configuration


On each of the hosts, host1, host2 and host3, create a keytab file with its corresponding oozie and HTTP principals
from the list above. All keytab files should also have the load balancer's HTTP principal. Hence, each keytab file
should have 3 principals in all.
Edit the following property in the Oozie server configuration file, oozie-site.xml:
<property>
<name>oozie.authentication.kerberos.principal</name>
<value>*</value>
</property>

Configuring Oozie to use SSL (HTTPS)


Important:
The default HTTPS configuration will cause all Oozie URLs to use HTTPS (except for the JobTracker
callback URLs, but this is okay because Oozie doesn't inherently trust the callbacks anyway; they are
used as hints). This is to simplify configuration (no changes are needed outside of Oozie).
You can use either a certificate from a Certificate Authority or a Self-Signed Certificate. Please follow the first
or second section below accordingly; afterwards, all steps are the same.
To use a Self-Signed Certificate There are many ways to create a Self-Signed Certificate, this is just one way.
We will be using the keytool program, which is included with your JRE. If its not on your path, you should be
able to find it in $JAVA_HOME/bin.
1. Run the following command to create a keystore file:
sudo -u oozie keytool -genkey -alias tomcat -keyalg RSA

The keystore file will be named .keystore and located in the oozie user's home directory.
2. You will now be asked a series of questions in an interactive prompt. Below is a sample of what this looks
like, along with some responses:
$ sudo -u oozie keytool -genkey -alias tomcat -keyalg RSA
Enter keystore password: password
Re-enter new password: password
What is your first and last name?
[Unknown]: oozie.server.hostname
What is the name of your organizational unit?
[Unknown]: Engineering
What is the name of your organization?
[Unknown]: A Great Company
What is the name of your City or Locality?
[Unknown]: Anywhere
What is the name of your State or Province?
[Unknown]: CA
What is the two-letter country code for this unit?
[Unknown]: US
Is CN=oozie.server.hostname, OU=Engineering, O=A Great Company, L=Anywhere, ST=CA,
C=US correct?
[no]: yes
Enter key password for <tomcat>
(RETURN if same as keystore password):

Important:
The password you enter for "keystore password" and "key password for <tomcat>" must be the
same. If you want to use a password other than "password", you will need to make an additional
change later when configuring the Oozie Server.
CDH 5 Security Guide | 89

Oozie Security Configuration


Important:
The answer to "What is your first and last name?" (i.e. "CN") must be the hostname of the machine
where the Oozie Server will be running.

3. Run the following command to export a certificate file from the keystore file:
sudo -u oozie keytool -exportcert -alias tomcat -file
path/to/where/I/want/my/certificate.cert

To use a Certificate from a Certificate Authority


1. Make a request to a Certificate Authority in order to obtain a proper Certificate; please consult a Certificate
Authority on this procedure.
2. Once you have your .cert file, run the following command to create a keystore file from your certificate:
sudo -u oozie keytool -import -alias tomcat -file path/to/certificate.cert

The keystore file will be named .keystore and located in the oozie user's home directory.
Configure the Oozie Server to use SSL (HTTPS)
1. Stop Oozie by running
sudo /sbin/service oozie stop

2. To enable SSL, set the MapReduce version that the Oozie server should work with using the alternatives
command.
Note: The alternatives command is only available on RHEL systems. For SLES, Ubuntu and
Debian systems, the command is update-alternatives.
For RHEL systems, to use YARN with SSL:
alternatives --set oozie-tomcat-conf /etc/oozie/tomcat-conf.https

For RHEL systems, to use MapReduce (MRv1) with SSL:


alternatives --set oozie-tomcat-conf /etc/oozie/tomcat-conf.https.mr1

Important:
The OOZIE_HTTPS_KEYSTORE_PASS variable must be the same as the password used when creating
the keystore file. If you used a password other than password, you'll have to change the value of
the OOZIE_HTTPS_KEYSTORE_PASS variable in this file.
3. Start Oozie by running
sudo /sbin/service oozie start

Configure the Oozie Client to connect using SSL (HTTPS)


This section only applies if you are using a Self-Signed Certificate.

90 | CDH 5 Security Guide

Oozie Security Configuration


Important:
The following steps must be done on every machine where you intend to use the Oozie Client. This
is not necessary if you only want to use the Web UI from a browser.
The first two steps are only necessary if you used a Self-Signed Certificate.
1. Copy or download the .cert file onto the client machine
2. Run the following command to import the certificate into the JRE's keystore. This will allow any Java program,
including the Oozie client, to connect to the Oozie Server using your certificate.
sudo keytool -import -alias tomcat -file path/to/certificate.cert -keystore
${JRE_cacerts}

Where ${JRE_cacerts} is the path to the JRE's certs file. It's location may differ depending on the Operating
System, but its typically called cacerts and located at ${JAVA_HOME}/lib/security/cacerts but may be
under a different directory in ${JAVA_HOME} (you may want to create a backup copy of this file first). The
default password is changeit.
3. When using the Oozie Client, you will need to use https://oozie.server.hostname:11443/oozie instead of
http://oozie.server.hostname:11000/oozie Java will not automatically redirect from the http address to
the https address.
Connect to the Oozie Web UI using SSL (HTTPS)
Use https://oozie.server.hostname:11443/oozie though most browsers should automatically redirect you if
you use http://oozie.server.hostname:11000/oozie
Important:
If using a Self-Signed Certificate, your browser will warn you that it can't verify the certificate or
something similar. You will probably have to add your certificate as an exception.

CDH 5 Security Guide | 91

HttpFS Security Configuration

HttpFS Security Configuration


This section describes how to configure HttpFS CDH 5 with Kerberos security on a Hadoop cluster:
Configuring the HttpFS Server to Support Kerberos Security on page 93
Using curl to access an URL Protected by Kerberos HTTP SPNEGO on page 94
For more information about HttpFS, see
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-hdfs-httpfs/index.html.
Important:
To enable HttpFS to work with Kerberos security on your Hadoop cluster, make sure you perform the
installation and configuration steps in Configuring Hadoop Security in CDH 5.

Important:
If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager,
NodeManagers, HttpFS, or Oozie services are configured to use Kerberos HTTP SPNEGO authentication,
and two or more of these services are running on the same host, then all of the running services must
use the same HTTP principal and keytab file used for their HTTP endpoints.

Configuring the HttpFS Server to Support Kerberos Security


1. Create an HttpFS service user principal that is used to authenticate with the Hadoop cluster. The syntax of
the principal is: httpfs/<fully.qualified.domain.name>@<YOUR-REALM> where:
fully.qualified.domain.name is the host where the HttpFS server is running YOUR-REALM is the name
of your Kerberos realm
kadmin: addprinc -randkey httpfs/[email protected]

2. Create a HTTP service user principal that is used to authenticate user requests coming to the HttpFS HTTP
web-services. The syntax of the principal is: HTTP/<fully.qualified.domain.name>@<YOUR-REALM>
where: 'fully.qualified.domain.name' is the host where the HttpFS server is running YOUR-REALM is
the name of your Kerberos realm
kadmin: addprinc -randkey HTTP/[email protected]

Important:
The HTTP/ component of the HTTP service user principal must be upper case as shown in the
syntax and example above.
3. Create keytab files with both principals.
$ kadmin
kadmin: xst -k httpfs.keytab httpfs/fully.qualified.domain.name
kadmin: xst -k http.keytab HTTP/fully.qualified.domain.name

CDH 5 Security Guide | 93

HttpFS Security Configuration


4. Merge the two keytab files into a single keytab file:
$ ktutil
ktutil: rkt httpfs.keytab
ktutil: rkt http.keytab
ktutil: wkt httpfs-http.keytab

5. Test that credentials in the merged keytab file work. For example:
$ klist -e -k -t httpfs-http.keytab

6. Copy the httpfs-http.keytab file to the HttpFS configuration directory. The owner of the
httpfs-http.keytab file should be the httpfs user and the file should have owner-only read permissions.
7. Edit the HttpFS server httpfs-site.xml configuration file in the HttpFS configuration directory by setting
the following properties:
Property

Value

httpfs.authentication.type

kerberos

httpfs.hadoop.authentication.type

kerberos

httpfs.authentication.kerberos.principal

HTTP/<HTTPFS-HOSTNAME>@<YOUR-REALM.COM>

httpfs.authentication.kerberos.keytab

/etc/hadoop-httpfs/conf/httpfs-http.keytab

httpfs.hadoop.authentication.kerberos.principal

httpfs/<HTTPFS-HOSTNAME>@<YOUR-REALM.COM>

httpfs.hadoop.authentication.kerberos.keytab

/etc/hadoop-httpfs/conf/httpfs-http.keytab

httpfs.authentication.kerberos.name.rules

Use the value configured for


'hadoop.security.auth_to_local' in 'core-site.xml'

Important:
You must restart the HttpFS server to have the configuration changes take effect.

Using curl to access an URL Protected by Kerberos HTTP SPNEGO


Important:
Your version of curl must support GSS and be capable of running curl -V.
To configure curl to access an URL protected by Kerberos HTTP SPNEGO:
1. Run curl -V:
$ curl -V
curl 7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8l
zlib/1.2.3
Protocols: tftp ftp telnet dict ldap http file https ftps
Features: GSS-Negotiate IPv6 Largefile NTLM SSL libz

2. Login to the KDC using kinit.


$ kinit
Please enter the password for tucu@LOCALHOST:

94 | CDH 5 Security Guide

HttpFS Security Configuration


3. Use curl to fetch the protected URL:
$ curl --negotiate -u : -b ~/cookiejar.txt -c ~/cookiejar.txt
http://localhost:14000/webhdfs/v1/?op=liststatus

where: The --negotiate option enables SPNEGO in curl. The -u : option is required but the user name is
ignored (the principal that has been specified for kinit is used). The -b and -c options are used to store
and send HTTP cookies.

Configuring HttpFS to use SSL (HTTPS)


You can use either a certificate from a Certificate Authority or a Self-Signed Certificate. Please follow the first
or second section below accordingly; afterwards, all steps are the same.
To use a Self-Signed Certificate There are many ways to create a Self-Signed Certificate, this is just one way.
We will be using the keytool program, which is included with your JRE. If its not on your path, you should be
able to find it in $JAVA_HOME/bin.
1. Run the following command to create a keystore file:
sudo -u httpfs keytool -genkey -alias tomcat -keyalg RSA

The keystore file will be named .keystore and located in the httpfs user's home directory.
2. You will now be asked a series of questions in an interactive prompt. Below is a sample of what this looks
like, along with some responses:
$ sudo -u httpfs keytool -genkey -alias tomcat -keyalg RSA
Enter keystore password: password
Re-enter new password: password
What is your first and last name?
[Unknown]: httpfs.server.hostname
What is the name of your organizational unit?
[Unknown]: Engineering
What is the name of your organization?
[Unknown]: A Great Company
What is the name of your City or Locality?
[Unknown]: Anywhere
What is the name of your State or Province?
[Unknown]: CA
What is the two-letter country code for this unit?
[Unknown]: US
Is CN=httpfs.server.hostname, OU=Engineering, O=A Great Company, L=Anywhere, ST=CA,
C=US correct?
[no]: yes
Enter key password for <tomcat>
(RETURN if same as keystore password):

Important:
The password you enter for "keystore password" and "key password for <tomcat>" must be the
same. If you want to use a password other than "password", you will need to make an additional
change later when configuring the HttpFS Server.

Important:
The answer to "What is your first and last name?" (i.e. "CN") must be the hostname of the machine
where the HttpFS Server will be running.

CDH 5 Security Guide | 95

HttpFS Security Configuration


3. Run the following command to export a certificate file from the keystore file:
sudo -u httpfs keytool -exportcert -alias tomcat -file
path/to/where/I/want/my/certificate.cert

To use a Certificate from a Certificate Authority


1. Make a request to a Certificate Authority in order to obtain a proper Certificate; please consult a Certificate
Authority on this procedure.
2. Once you have your .cert file, run the following command to create a keystore file from your certificate:
sudo -u httpfs keytool -import -alias tomcat -file path/to/certificate.cert

The keystore file will be named .keystore and located in the httpfs user's home directory.
Configure the HttpFS Server to use SSL (HTTPS)
1. Stop HttpFS by running
sudo /sbin/service hadoop-httpfs stop

2. To enable SSL, change which configuration the HttpFS server should work with using the alternatives
command.
Note: The alternatives command is only available on RHEL systems. For SLES, Ubuntu and
Debian systems, the command is update-alternatives.
For RHEL systems, to use SSL:
alternatives --set hadoop-httpfs-tomcat-conf /etc/hadoop-httpfs/tomcat-conf.https

Important:
The HTTPFS_SSL_KEYSTORE_PASS variable must be the same as the password used when creating
the keystore file. If you used a password other than password, you'll have to change the value of
the HTTPFS_SSL_KEYSTORE_PASS variable in /etc/hadoop-httpfs/conf/httpfs-env.sh.
3. Start HttpFS by running
sudo /sbin/service hadoop-httpfs start

Configure the HttpFS Client to connect using SSL (HTTPS)


This section only applies if you are using a Self-Signed Certificate.
Important:
The following steps must be done on every machine where you intend to use the HttpFS Client. This
is not necessary if you only want to use the Web UI from a browser.
The first two steps are only necessary if you used a Self-Signed Certificate.
1. Copy or download the .cert file onto the client machine

96 | CDH 5 Security Guide

HttpFS Security Configuration


2. Run the following command to import the certificate into the JRE's keystore. This will allow any Java program,
including the HttpFS client, to connect to the HttpFS Server using your certificate.
sudo keytool -import -alias tomcat -file path/to/certificate.cert -keystore
${JRE_cacerts}

Where ${JRE_cacerts} is the path to the JRE's certs file. It's location may differ depending on the Operating
System, but its typically called cacerts and located at ${JAVA_HOME}/lib/security/cacerts but may be
under a different directory in ${JAVA_HOME} (you may want to create a backup copy of this file first). The
default password is changeit.
3. When using the HttpFS Client, you will need to use
https://<httpfs_server_hostname>:14000/webhdfs/v1/ instead of
http://<httpfs_server_hostname>:14000/webhdfs/v1/ Java will not automatically redirect from the
http address to the https address.

Connect to the HttpFS Web UI using SSL (HTTPS)


Use https://<httpfs_server_hostname>:14000/webhdfs/v1/ though most browsers should automatically
redirect you if you use http://<httpfs_server_hostname>:14000/webhdfs/v1/
Important:
If using a Self-Signed Certificate, your browser will warn you that it can't verify the certificate or
something similar. You will probably have to add your certificate as an exception.

CDH 5 Security Guide | 97

HBase Security Configuration

HBase Security Configuration


There are two major parts in the process of configuring HBase security:
1. Configure HBase Authentication: You must establish a mechanism for HBase servers and clients to securely
identify themselves with HDFS, ZooKeeper, and each other (called authentication). This ensures that, for
example, a host claiming to be an HBase Region Server or a particular HBase client are in fact who they claim
to be.
2. Configure HBase Authorization: You must establish rules for the resources that clients are allowed to access
(called authorization).
For more background information, see this blog post.
The following sections describe how to use Apache HBase and CDH 5 with Kerberos security on your Hadoop
cluster:

Configuring HBase Authentication on page 99


Configuring HBase Authorization on page 102
Configuring Secure HBase Replication on page 105
Configuring the HBase Client TGT Renewal Period on page 106
Important:
To enable HBase to work with Kerberos security on your Hadoop cluster, make sure you perform the
installation and configuration steps in Configuring Hadoop Security in CDH 5 and ZooKeeper Security
Configuration.

Note:
These instructions have been tested with CDH and MIT Kerberos 5 only.

Important:
Although an HBase Thrift server can connect to a secured Hadoop cluster, access is not secured from
clients to the HBase Thrift server.

Configuring HBase Authentication


Here are the two high-level steps for configuring HBase authentication:
Step 1: Configure HBase Servers to Authenticate with a Secure HDFS Cluster on page 99
Step 2: Configure HBase Servers and Clients to Authenticate with a Secure ZooKeeper on page 101.

Step 1: Configure HBase Servers to Authenticate with a Secure HDFS Cluster


To configure HBase servers to authenticate with a secure HDFS cluster, you must do the following tasks:
Enable HBase Authentication
Configure HBase's Kerberos Principals
Enabling HBase Authentication
To enable HBase Authentication, you must do the following two steps:
CDH 5 Security Guide | 99

HBase Security Configuration


1. On every HBase server host (Master or Region Server), add the following properties to the hbase-site.xml
configuration file:
<property>
<name>hbase.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hbase.rpc.engine</name>
<value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>
</property>

2. On every HBase client host, add the same properties to the hbase-site.xml configuration file:
<property>
<name>hbase.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hbase.rpc.engine</name>
<value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>
</property>

Configuring HBase's Kerberos Principals


In order to run HBase on a secure HDFS cluster, HBase must authenticate itself to the HDFS services. HBase
acts as a Kerberos principal and needs Kerberos credentials to interact with the Kerberos-enabled HDFS daemons.
Authenticating a service can be done using a keytab file. This file contains a key which allows the service to
authenticate to the Kerberos Key Distribution Center (KDC).
To configure HBase's Kerberos principals:
1. Create a service principal for the HBase server using the syntax:
hbase/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate the HBase
server with the HDFS services. Cloudera recommends using hbase as the username portion of this principal.
kadmin: addprinc -randkey hbase/[email protected]

where: fully.qualified.domain.name is the host where the HBase server is running YOUR-REALM is the
name of your Kerberos realm
2. Create a keytab file for the HBase server.
$ kadmin
kadmin: xst -k hbase.keytab hbase/fully.qualified.domain.name

3. Copy the hbase.keytab file to the /etc/hbase/conf directory on the HBase server host. The owner of the
hbase.keytab file should be the hbase user and the file should have owner-only read permissions. That is,
assign the file 0600 permissions and make it owned by hbase:hbase.
-r--------

1 hbase

hbase

1343 2012-01-09 10:39

hbase.keytab

4. To test that the keytab file was created properly, try to obtain Kerberos credentials as the HBase principal
using only the keytab file. Substitute your fully.qualified.domain.name and realm in the following
command:
$ kinit -k -t /etc/hbase/conf/hbase.keytab
hbase/[email protected]

100 | CDH 5 Security Guide

HBase Security Configuration


5. In the /etc/hbase/conf/hbase-site.xml configuration file on all of your cluster hosts running the HBase
daemon, add the following lines:
<property>
<name>hbase.regionserver.kerberos.principal</name>
<value>hbase/[email protected]</value>
</property>
<property>
<name>hbase.regionserver.keytab.file</name>
<value>/etc/hbase/conf/hbase.keytab</value>
</property>
<property>
<name>hbase.master.kerberos.principal</name>
<value>hbase/[email protected]</value>
</property>
<property>
<name>hbase.master.keytab.file</name>
<value>/etc/hbase/conf/hbase.keytab</value>
</property>

Important:
Make sure you change the /etc/hbase/conf/hbase-site.xml configuration file on all of your
cluster hosts that are running the HBase daemon.

Step 2: Configure HBase Servers and Clients to Authenticate with a Secure ZooKeeper
In order to run a secure HBase, you must also use a secure ZooKeeper. To use your secure ZooKeeper, each
HBase host machine (Master, Region Server, and client) must have a principal that allows it to authenticate with
your secure ZooKeeper ensemble. Note, this HBase section assumes that your secure ZooKeeper is already
configured according to the instructions in the ZooKeeper Security Configuration section and not managed by
HBase.
This HBase section also assumes that you have successfully completed the previous steps, and already have a
principal and keytab file created and in place for every HBase server and client.
Configure HBase JVMs (all Masters, Region Servers, and clients) to use JAAS
1. On each host, set up a Java Authentication and Authorization Service (JAAS) by creating a
/etc/hbase/conf/zk-jaas.conf file that contains the following:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="/etc/hbase/conf/hbase.keytab"
principal="hbase/fully.qualified.domain.name@<YOUR-REALM>";
};

2. Modify the hbase-env.sh file on HBase server and client hosts to include the following:
export HBASE_OPTS="$HBASE_OPTS
-Djava.security.auth.login.config=/etc/hbase/conf/zk-jaas.conf"
export HBASE_MANAGES_ZK=false

CDH 5 Security Guide | 101

HBase Security Configuration


Configure the HBase Servers (Masters and Region Servers) to use Authentication to connect to ZooKeeper
1. Update your hbase-site.xml on each HBase server host with the following properties:
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>$ZK_NODES</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>

where $ZK_NODES is the comma-separated list of hostnames of the ZooKeeper Quorum hosts that you
configured according to the instructions in ZooKeeper Security Configuration.
2. Add the following lines to the ZooKeeper configuration file zoo.cfg:
kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true

Start HBase
If the configuration worked, you should see something similar to the following in the HBase Master and Region
Server logs when you start the cluster:
INFO zookeeper.ZooKeeper: Initiating client connection,
connectString=ZK_QUORUM_SERVER:2181 sessionTimeout=180000 watcher=master:60000
INFO zookeeper.ClientCnxn: Opening socket connection to server /ZK_QUORUM_SERVER:2181
INFO zookeeper.RecoverableZooKeeper: The identifier of this process is
PID@ZK_QUORUM_SERVER
INFO zookeeper.Login: successfully logged in.
INFO client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
INFO zookeeper.Login: TGT refresh thread started.
INFO zookeeper.ClientCnxn: Socket connection established to ZK_QUORUM_SERVER:2181,
initiating session
INFO zookeeper.Login: TGT valid starting at:
Sun Apr 08 22:43:59 UTC 2012
INFO zookeeper.Login: TGT expires:
Mon Apr 09 22:43:59 UTC 2012
INFO zookeeper.Login: TGT refresh sleeping until: Mon Apr 09 18:30:37 UTC 2012
INFO zookeeper.ClientCnxn: Session establishment complete on server
ZK_QUORUM_SERVER:2181, sessionid = 0x134106594320000, negotiated timeout = 180000

Configuring HBase Authorization


After you have configured HBase authentication as described here, you must establish authorization rules for
the resources that a client is allowed to access. In this release, HBase only allows you to establish authorization
rules on a column, table, or namespace level. Cell-level authorization is an experimental feature.

Understanding HBase Access Levels


HBase access levels are granted independently of each other and allow for different types of operations at a
given scope.

Read (R) - can read data at the given scope


Write (W) - can write data at the given scope
Execute (X) - can execute coprocessor endpoints at the given scope
Create (C) - can create tables or drop tables (even those they did not create) at the given scope
Admin (A) - can perform cluster operations such as balancing the cluster or assigning regions at the given
scope

102 | CDH 5 Security Guide

HBase Security Configuration


The possible scopes are:
Superuser - superusers can perform any operation available in HBase, to any resource. The user who runs
HBase on your cluster is a superuser, as are any principals assigned to the configuration property
hbase.superuser in hbase-site.xml on the HMaster.
Global - permissions granted at global scope allow the admin to operate on all tables of the cluster.
Namespace - permissions granted at namespace scope apply to all tables within a given namespace.
Table - permissions granted at table scope apply to data or metadata within a given table.
ColumnFamily - permissions granted at ColumnFamily scope apply to cells within that ColumnFamily.
Cell - permissions granted at Cell scope apply to that exact cell coordinate. This allows for policy evolution
along with data. To change an ACL on a specific cell, write an updated cell with new ACL to the precise
coordinates of the original. If you have a multi-versioned schema and want to update ACLs on all visible
versions, you'll need to write new cells for all visible versions. The application has complete control over policy
evolution. The exception is append and increment processing. Appends and increments can carry an ACL
in the operation. If one is included in the operation, then it will be applied to the result of the append or
increment. Otherwise, the ACL of the existing cell being appended to or incremented is preserved.
The combination of access levels and scopes creates a matrix of possible access levels that can be granted to
a user. In a production environment, it is useful to think of access levels in terms of what is needed to do a
specific job. The following list describes appropriate access levels for some common types of HBase users. It is
important not to grant more access than is required for a given user to perform their required tasks.
Superusers - In a production system, only the HBase user should have superuser access. In a development
environment, an administrator may need superuser access in order to quickly control and manage the cluster.
However, this type of administrator should usually be a Global Admin rather than a superuser.
Global Admins - A global admin can perform tasks and access every table in HBase. In a typical production
environment, an admin should not have Read or Write permissions to data within tables.
A global admin with Admin permissions can perform cluster-wide operations on the cluster, such as
balancing, assigning or unassigning regions, or calling an explicit major compaction. This is an operations
role.
A global admin with Create permissions can create or drop any table within HBase. This is more of a
DBA-type role.
In a production environment, it is likely that different users will have only one of Admin and Create
permissions.
Warning:
In the current implementation, a Global Admin with Admin permission can grant himself Read
and Write permissions on a table and gain access to that table's data. For this reason, only grant
Global Admin permissions to trusted user who actually need them.
Also be aware that a Global Admin with Create permission can perform a Put operation on the
ACL table, simulating a grant or revoke and circumventing the authorization check for Global
Admin permissions. This issue (but not the first one) is fixed in CDH 5.3 and newer, as well as CDH
5.2.1. It is not fixed in CDH 4.x or CDH 5.1.x.
Due to these issues, be cautious with granting Global Admin privileges.
Namespace Admin - a namespace admin with Create permissions can create or drop tables within that
namespace, and take and restore snapshots. A namespace admin with Admin permissions can perform
operations such as splits or major compactions on tables within that namespace.
Table Admins - A table admin can perform administrative operations only on that table. A table admin with
Create permissions can create snapshots from that table or restore that table from a snapshot. A table
admin with Admin permissions can perform operations such as splits or major compactions on that table.

CDH 5 Security Guide | 103

HBase Security Configuration


Users - Users can read or write data, or both. Users can also execute coprocessor endpoints, if given
Executable permissions.
Table 9: Real-World Example of Access Levels
This table shows some typical job descriptions at a hypothetical company and the permissions they might
require in order to get their jobs done using HBase.
Job Title

Scope

Permissions

Description

Senior Administrator

Global

Access, Create

Manages the cluster and


gives access to Junior
Administrators.

Junior Administrator

Global

Create

Creates tables and gives


access to Table
Administrators.

Table Administrator

Table

Access

Maintains a table from an


operations point of view.

Data Analyst

Table

Read

Creates reports from


HBase data.

Web Application

Table

Read, Write

Puts data into HBase and


uses HBase data to
perform operations.

Further Reading
Access Control Matrix
Security - Apache HBase Reference Guide

Enable HBase Authorization


HBase Authorization is built on top of the Coprocessors framework, specifically AccessContoller Coprocessor.
To enable HBase authorization, add the following properties to the hbase-site.xml file on every HBase server
host (Master or Region Server):
<property>
<name>hbase.security.authorization</name>
<value>true</value>
</property>
<property>
<name>hbase.coprocessor.master.classes</name>
<value>org.apache.hadoop.hbase.security.access.AccessController</value>
</property>
<property>
<name>hbase.coprocessor.region.classes</name>
<value>org.apache.hadoop.hbase.security.token.TokenProvider,org.apache.hadoop.hbase.security.access.AccessController</value>
</property>

Note:
Once the Access Controller coprocessor is enabled, any user who uses the HBase shell will be subject
to access control. Access control will also be in effect for native (Java API) client access to HBase.

104 | CDH 5 Security Guide

HBase Security Configuration


Configure Access Control Lists for Authorization
Now that HBase has the security coprocessor enabled, you can set ACLs via the HBase shell. Start the HBase
shell as usual.
Important:
The host running the shell must be configured with a keytab file as described here.
The commands that control ACLs are of the form of:
grant <user> <permissions>[ <table>[ <column family>[ <column qualifier> ] ] ]
#grants permissions
revoke <user> <permissions> [ <table> [ <column family> [ <column qualifier> ] ] ]
# revokes permissions
user_permission <table> # displays existing permissions

In the above commands, fields encased in <> are variables, and fields in [] are optional. The permissions
variable must consist of zero or more character from the set "RWCA".
R denotes read permissions, which is required to perform Get, Scan, or Exists calls in a given scope.
W denotes write permissions, which is required to perform Put, Delete, LockRow, UnlockRow,
IncrementColumnValue, CheckAndDelete, CheckAndPut, Flush, or Compact in a given scope.
X denotes execute permissions, which is required to execute coprocessor endpoints.
C denotes create permissions, which is required to perform Create, Alter, or Drop in a given scope.
A denotes admin permissions, which is required to perform Enable, Disable, Snapshot, Restore, Clone,
Split, MajorCompact, Grant, Revoke, and Shutdown in a given scope.
For example:
grant 'user1', 'RWC'
grant 'user2', 'RW', 'tableA'

Be sure to review the information in Understanding HBase Access Levels on page 102 to understand the
implications of the different access levels.

Configuring Secure HBase Replication


If you are using HBase Replication and you want to make it secure, read this section for instructions. Before
proceeding, you should already have configured HBase Replication by following the instructions in the HBase
Replication section of the CDH 5 Installation Guide.
To configure secure HBase replication, you must configure cross realm support for Kerberos, ZooKeeper, and
Hadoop.
To configure secure HBase replication:
1. Create krbtgt principals for the two realms. For example, if you have two realms called ONE.COM and TWO.COM,
you need to add the following principals: krbtgt/[email protected] and krbtgt/[email protected]. Add
these two principals at both realms. Note that there must be at least one common encryption mode between
these two realms.
kadmin: addprinc -e "<enc_type_list>" krbtgt/[email protected]
kadmin: addprinc -e "<enc_type_list>" krbtgt/[email protected]

CDH 5 Security Guide | 105

HBase Security Configuration


2. Add rules for creating short names in Zookeeper. To do this, add a system level property in java.env, defined
in the conf directory. Here is an example rule that illustrates how to add support for the realm called ONE.COM,
and have two members in the principal (such as service/[email protected]):
-Dzookeeper.security.auth_to_local=RULE:[2:\$1@\$0](.*@\\QONE.COM\\E$)s/@\\QONE.COM\\E$//DEFAULT

The above code example adds support for the ONE.COM realm in a different realm. So, in the case of replication,
you must add a rule for the master cluster realm in the slave cluster realm. DEFAULT is for defining the default
rule.
3. Add rules for creating short names in the Hadoop processes. To do this, add the
hadoop.security.auth_to_local property in the core-site.xml file in the slave cluster. For example,
to add support for the ONE.COM realm:
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1@$0](.*@\QONE.COM\E$)s/@\QONE.COM\E$//
DEFAULT
</value>
</property>

For more information about adding rules, see Appendix C - Configuring the Mapping from Kerberos Principals
to Short Names.

Configuring the HBase Client TGT Renewal Period


An HBase client user must also have a Kerberos principal which typically has a password that only the user
knows. You should configure the maxrenewlife setting for the client's principal to a value that allows the user
enough time to finish HBase client processes before the ticket granting ticket (TGT) expires. For example, if the
HBase client processes require up to four days to complete, you should create the user's principal and configure
the maxrenewlife setting by using this command:
kadmin: addprinc -maxrenewlife 4days

106 | CDH 5 Security Guide

Impala Security Configuration

Impala Security Configuration


Impala 1.1 and higher include a fine-grained authorization framework for Hadoop, by integrating the Sentry
open source project. Together with the existing Kerberos authentication framework, Sentry takes Hadoop security
to a new level needed for the requirements of highly regulated industries such as healthcare, financial services,
and government. Impala 1.1.1 and higher fill in the security feature set even more by adding an auditing capability;
Impala generates the audit data, the Cloudera Navigator product consolidates the audit data from all nodes in
the cluster, and Cloudera Manager lets you filter, visualize, and produce reports.
The security features of Cloudera Impala have several objectives. At the most basic level, security prevents
accidents or mistakes that could disrupt application processing, delete or corrupt data, or reveal data to
unauthorized users. More advanced security features and practices can harden the system against malicious
users trying to gain unauthorized access or perform other disallowed operations. The auditing feature provides
a way to confirm that no unauthorized access occurred, and detect whether any such attempts were made. This
is a critical set of features for production deployments in large organizations that handle important or sensitive
data. It sets the stage for multi-tenancy, where multiple applications run concurrently and are prevented from
interfering with each other.
The material in this section presumes that you are already familiar with administering secure Linux systems.
That is, you should know the general security practices for Linux and Hadoop, and their associated commands
and configuration files. For example, you should know how to create Linux users and groups, manage Linux
group membership, set Linux and HDFS file permissions and ownership, and designate the default permissions
and ownership for new files. You should be familiar with the configuration of the nodes in your Hadoop cluster,
and know how to apply configuration changes or run a set of commands across all the nodes.
The security features are divided into these broad categories:
authorization
Which users are allowed to access which resources, and what operations are they allowed to perform?
Impala relies on the open source Sentry project for authorization. By default (when authorization is not
enabled), Impala does all read and write operations with the privileges of the impala user, which is
suitable for a development/test environment but not for a secure production environment. When
authorization is enabled, Impala uses the OS user ID of the user who runs impala-shell or other client
program, and associates various privileges with each user. See Enabling Sentry Authorization for Impala
on page 110 for details about setting up and managing authorization.
authentication
How does Impala verify the identity of the user to confirm that they really are allowed to exercise the
privileges assigned to that user? Impala relies on the Kerberos subsystem for authentication. See Enabling
Kerberos Authentication for Impala on page 122 for details about setting up and managing authentication.
auditing
What operations were attempted, and did they succeed or not? This feature provides a way to look back
and diagnose whether attempts were made to perform unauthorized operations. You use this information
to track down suspicious activity, and to see where changes are needed in authorization policies. The
audit data produced by this feature is collected by the Cloudera Manager product and then presented in
a user-friendly form by the Cloudera Manager product. See Auditing Impala Operations on page 126 for
details about setting up and managing auditing.
The following sections lead you through the various security-related features of Impala.

Security Guidelines for Impala


The following are the major steps to harden a cluster running Impala against accidents and mistakes, or malicious
attackers trying to access sensitive data:

CDH 5 Security Guide | 107

Impala Security Configuration


Secure the root account. The root user can tamper with the impalad daemon, read and write the data files
in HDFS, log into other user accounts, and access other system services that are beyond the control of Impala.
Restrict membership in the sudoers list (in the /etc/sudoers file). The users who can run the sudo command
can do many of the same things as the root user.
Ensure the Hadoop ownership and permissions for Impala data files are restricted.
Ensure the Hadoop ownership and permissions for Impala log files are restricted.
Ensure that the Impala web UI (available by default on port 25000 on each Impala node) is password-protected.
See Securing the Impala Web User Interface on page 109 for details.
Create a policy file that specifies which Impala privileges are available to users in particular Hadoop groups
(which by default map to Linux OS groups). Create the associated Linux groups using the groupadd command
if necessary.
The Impala authorization feature makes use of the HDFS file ownership and permissions mechanism; for
background information, see the CDH HDFS Permissions Guide. Set up users and assign them to groups at
the OS level, corresponding to the different categories of users with different access levels for various
databases, tables, and HDFS locations (URIs). Create the associated Linux users using the useradd command
if necessary, and add them to the appropriate groups with the usermod command.
Design your databases, tables, and views with database and table structure to allow policy rules to specify
simple, consistent rules. For example, if all tables related to an application are inside a single database, you
can assign privileges for that database and use the * wildcard for the table name. If you are creating views
with different privileges than the underlying base tables, you might put the views in a separate database so
that you can use the * wildcard for the database containing the base tables, while specifying the precise
names of the individual views. (For specifying table or database names, you either specify the exact name
or * to mean all the databases on a server, or all the tables and views in a database.)
Enable authorization by running the impalad daemons with the -server_name and
-authorization_policy_file options on all nodes. (The authorization feature does not apply to the
statestored daemon, which has no access to schema objects or data files.)
Set up authentication using Kerberos, to make sure users really are who they say they are.

Securing Impala Data and Log Files


One aspect of security is to protect files from unauthorized access at the filesystem level. For example, if you
store sensitive data in HDFS, you specify permissions on the associated files and directories in HDFS to restrict
read and write permissions to the appropriate users and groups.
If you issue queries containing sensitive values in the WHERE clause, such as financial account numbers, those
values are stored in Impala log files in the Linux filesystem and you must secure those files also. For the locations
of Impala log files, see Using Impala Logging.
All Impala read and write operations are performed under the filesystem privileges of the impala user. The
impala user must be able to read all directories and data files that you query, and write into all the directories
and data files for INSERT and LOAD DATA statements. At a minimum, make sure the impala user is in the hive
group so that it can access files and directories shared between Impala and Hive. See User Account Requirements
for more details.
Setting file permissions is necessary for Impala to function correctly, but is not an effective security practice by
itself:
The way to ensure that only authorized users can submit requests for databases and tables they are allowed
to access is to set up Sentry authorization, as explained in Enabling Sentry Authorization for Impala on page
110. With authorization enabled, the checking of the user ID and group is done by Impala, and unauthorized
access is blocked by Impala itself. The actual low-level read and write requests are still done by the impala
user, so you must have appropriate file and directory permissions for that user ID.
You must also set up Kerberos authentication, as described in Enabling Kerberos Authentication for Impala
on page 122, so that users can only connect from trusted hosts. With Kerberos enabled, if someone connects
a new host to the network and creates user IDs that match your privileged IDs, they will be blocked from
connecting to Impala at all from that host.
108 | CDH 5 Security Guide

Impala Security Configuration

Installation Considerations for Impala Security


Impala 1.1 comes set up with all the software and settings needed to enable security when you run the impalad
daemon with the new security-related options (-server_name and -authorization_policy_file). You do
not need to change any environment variables or install any additional JAR files. In a cluster managed by Cloudera
Manager, you do not need to change any settings in Cloudera Manager.

Securing the Hive Metastore Database


It is important to secure the Hive metastore, so that users cannot access the names or other information about
databases and tables the through the Hive client or by querying the metastore database. Do this by turning on
Hive metastore security, using the instructions in the CDH 4 Security Guide or the CDH 5 Security Guide for
securing different Hive components:
Secure the Hive Metastore.
In addition, allow access to the metastore only from the HiveServer2 server, and then disable local access to
the HiveServer2 server.

Securing the Impala Web User Interface


The instructions in this section presume you are familiar with the .htpasswd mechanism commonly used to
password-protect pages on web servers.
Password-protect the Impala web UI that listens on port 25000 by default. Set up a .htpasswd file in the
$IMPALA_HOME directory, or start both the impalad and statestored daemons with the
--webserver_password_file option to specify a different location (including the filename).
This file should only be readable by the Impala process and machine administrators, because it contains (hashed)
versions of passwords. The username / password pairs are not derived from Unix usernames, Kerberos users,
or any other system. The domain field in the password file must match the domain supplied to Impala by the
new command-line option --webserver_authentication_domain. The default is mydomain.com.
Impala also supports using HTTPS for secure web traffic. To do so, set --webserver_certificate_file to
refer to a valid .pem SSL certificate file. Impala will automatically start using HTTPS once the SSL certificate has
been read and validated. A .pem file is basically a private key, followed by a signed SSL certificate; make sure to
concatenate both parts when constructing the .pem file.
If Impala cannot find or parse the .pem file, it prints an error message and quits.
Note:
If the private key is encrypted using a passphrase, Impala will ask for that passphrase on startup,
which is not useful for a large cluster. In that case, remove the passphrase and make the .pem file
readable only by Impala and administrators.
When you turn on SSL for the Impala web UI, the associated URLs change from http:// prefixes to
https://. Adjust any bookmarks or application code that refers to those URLs.

Enabling SSL for Impala


Impala supports SSL network encryption, between Impala and client programs, and between the Impala-related
daemons running on different nodes in the cluster. This feature is important when you also use other features
such as Kerberos authentication or Sentry authorization, where credentials are being transmitted back and
forth.
CDH 5 Security Guide | 109

Impala Security Configuration


To enable SSL for Impala network communication, add both of the following flags to the impalad startup options:
--ssl_server_certificate: the full path to the server certificate, on the local filesystem.
--ssl_private_key : the full path to the server private key, on the local filesystem.
If either of these flags are set, both must be set. In that case, Impala starts listening for Beeswax and HiveServer2
requests on SSL-secured ports only. (The port numbers stay the same; see Appendix A - Ports Used by Impala
for details.)
Typically, a client program has corresponding options to verify that it is connecting to the right server. For
example, with SSL enabled for Impala, you use the following options when starting the impala-shell interpreter:
--ssl: enables SSL for impala-shell.
--ca_cert: the local pathname pointing to the third-party CA certificate, or to a copy of the server certificate
for self-signed server certificates.
If --ca_cert is not set, impala-shell enables SSL, but does not validate the server certificate. This is useful
for connecting to a known-good Impala that is only running over SSL, when a copy of the certificate is not
available (such as when debugging customer installations).

Enabling Sentry Authorization for Impala


Authorization determines which users are allowed to access which resources, and what operations they are
allowed to perform. In Impala 1.1 and higher, you use the Sentry open source project for authorization. Sentry
adds a fine-grained authorization framework for Hadoop. By default (when authorization is not enabled), Impala
does all read and write operations with the privileges of the impala user, which is suitable for a development/test
environment but not for a secure production environment. When authorization is enabled, Impala uses the OS
user ID of the user who runs impala-shell or other client program, and associates various privileges with each
user.
Note: Sentry is typically used in conjunction with Kerberos authentication, which defines which hosts
are allowed to connect to each server. Using the combination of Sentry and Kerberos prevents malicious
users from being able to connect by creating a named account on an untrusted machine. See Enabling
Kerberos Authentication for Impala on page 122 for details about Kerberos authentication.

The Sentry Privilege Model


Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated
with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object
automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.
The object hierarchy covers Server, URI, Database, and Table. (The Table privileges apply to views as well; anywhere
you specify a table name, you can specify a view name instead.) Currently, you cannot assign privileges at the
partition or column level. The way you implement column-level or partition-level privileges is to create a view
that queries just the relevant columns or partitions, and assign privileges to the view rather than the underlying
table or tables.
A restricted set of privileges determines what you can do with each object:
SELECT privilege
Lets you read data from a table or view, for example with the SELECT statement, the INSERT...SELECT
syntax, or CREATE TABLE...LIKE. Also required to issue the DESCRIBE statement or the EXPLAIN
statement for a query against a particular table. Only objects for which a user has this privilege are shown
in the output for SHOW DATABASES and SHOW TABLES statements. The REFRESH statement and
INVALIDATE METADATA statements only access metadata for tables for which the user has this privilege.
INSERT privilege
Lets you write data to a table. Applies to the INSERT and LOAD DATA statements.

110 | CDH 5 Security Guide

Impala Security Configuration


ALL privilege
Lets you create or modify the object. Required to run DDL statements such as CREATE TABLE, ALTER
TABLE, or DROP TABLE for a table, CREATE DATABASE or DROP DATABASE for a database, or CREATE VIEW,
ALTER VIEW, or DROP VIEW for a view. Also required for the URI of the location parameter for the CREATE
EXTERNAL TABLE and LOAD DATA statements.
Privileges can be specified for a table or view before that object actually exists. If you do not have sufficient
privilege to perform an operation, the error message does not disclose if the object exists or not.
Originally, privileges were encoded in a policy file, stored in HDFS. This mode of operation is still an option, but
the emphasis of privilege management is moving towards being SQL-based. Although currently Impala does
not have GRANT or REVOKE statements, Impala can make use of privileges assigned through GRANT and REVOKE
statements done through Hive. The mode of operation with GRANT and REVOKE statements instead of the policy
file requires that a special Sentry service be enabled; this service stores, retrieves, and manipulates privilege
information stored inside the metastore database.

Starting the impalad Daemon with Sentry Authorization Enabled


To run the impalad daemon with authorization enabled, you add one or more options to the IMPALA_SERVER_ARGS
declaration in the /etc/default/impala configuration file:
The -server_name option turns on Sentry authorization for Impala. The authorization rules refer to a symbolic
server name, and you specify the name to use as the argument to the -server_name option.
If you specify just -server_name, Impala uses the Sentry service for authorization, relying on the results of
GRANT and REVOKE statements issued through Hive. (This mode of operation is available in Impala 1.4.0 and
higher.) Prior to Impala 1.4.0, or if you want to continue storing privilege rules in the policy file, also specify
the -authorization_policy_file option as in the following item.
Specifying the -authorization_policy_file option in addition to -server_name makes Impala read
privilege information from a policy file, rather than from the metastore database. The argument to the
-authorization_policy_file option specifies the HDFS path to the policy file that defines the privileges
on different schema objects.
For example, you might adapt your /etc/default/impala configuration to contain lines like the following. To
use the Sentry service rather than the policy file:
IMPALA_SERVER_ARGS=" \
-server_name=server1 \
...

Or to use the policy file, as in releases prior to Impala 1.4:


IMPALA_SERVER_ARGS=" \
-authorization_policy_file=/user/hive/warehouse/auth-policy.ini \
-server_name=server1 \
...

The preceding examples set up a symbolic name of server1 to refer to the current instance of Impala. This
symbolic name is used in the following ways:
In an environment managed by Cloudera Manager, the server name is specified through Impala > Service-Wide
> Advanced > Server Name for Sentry Authorization and Hive > Service-Wide > Advanced > Server Name for
Sentry Authorization. The values must be the same for both, so that Impala and Hive can share the privilege
rules. Restart the Impala and Hive services after setting or changing this value.
In an environment not managed by Cloudera Manager, you specify this value for the sentry.hive.server
property in the sentry-site.xml configuration file for Hive, as well as in the -server_name option for
impalad.

CDH 5 Security Guide | 111

Impala Security Configuration


If the impalad daemon is not already running, start it as described in Starting Impala. If it is already running,
restart it with the command sudo /etc/init.d/impala-server restart. Run the appropriate commands
on all the nodes where impalad normally runs.
If you use the mode of operation using the policy file, the rules in the [roles] section of the policy file refer
to this same server1 name. For example, the following rule sets up a role report_generator that lets
users with that role query any table in a database named reporting_db on a node where the impalad
daemon was started up with the -server_name=server1 option:
[roles]
report_generator = server=server1->db=reporting_db->table=*->action=SELECT

When impalad is started with one or both of the -server_name=server1 and -authorization_policy_file
options, Impala authorization is enabled. If Impala detects any errors or inconsistencies in the authorization
settings or the policy file, the daemon refuses to start.

Using Impala with the Sentry Service (CDH 5.1 or higher only)
When you use the Sentry service rather than the policy file, you set up privileges through GRANT and REVOKE
statement in Hive, then Impala inherits those same privileges automatically. (Currently, Impala does not implement
the GRANT and REVOKE statements.)
Hive already had GRANT and REVOKE statements prior to CDH 5.1, but those statements were not production-ready.
CDH 5.1 is the first release where those statements use the Sentry framework and are considered GA level. If
you used the Hive GRANT and REVOKE statements prior to CDH 5.1, you must set up these privileges with the
CDH 5.1 versions of GRANT and REVOKE to take advantage of Sentry authorization.
For information about using the updated Hive GRANT and REVOKE statements, see Sentry service topic in the
CDH 5 Security Guide.

Using Impala with the Sentry Policy File


The policy file is a file that you put in a designated location in HDFS, and is read during the startup of the impalad
daemon when you specify both the -server_name and -authorization_policy_file startup options. It
controls which objects (databases, tables, and HDFS directory paths) can be accessed by the user who connects
to impalad, and what operations that user can perform on the objects.
Note: This mode of operation works on both CDH 4 and CDH 5, but in CDH 5 the emphasis is shifting
towards managing privileges through SQL statements, as described in Using Impala with the Sentry
Service (CDH 5.1 or higher only) on page 112. If you are still using policy files, plan to migrate to the
new approach some time in the future.
The location of the policy file is listed in the auth-site.xml configuration file. To minimize overhead, the security
information from this file is cached by each impalad daemon and refreshed automatically, with a default interval
of 5 minutes. After making a substantial change to security policies, restart all Impala daemons to pick up the
changes immediately.

Policy File Location and Format


The policy file uses the familiar .ini format, divided into the major sections [groups] and [roles]. There is
also an optional [databases] section, which allows you to specify a specific policy file for a particular database,
as explained in Using Multiple Policy Files for Different Databases on page 117. Another optional section, [users],
allows you to override the OS-level mapping of users to groups; that is an advanced technique primarily for
testing and debugging, and is beyond the scope of this document.
In the [groups] section, you define various categories of users and select which roles are associated with each
category. The group and user names correspond to Linux groups and users on the server where the impalad
daemon runs.

112 | CDH 5 Security Guide

Impala Security Configuration


The group and user names in the [groups] section correspond to Linux groups and users on the server where
the impalad daemon runs. When you access Impala through the impalad interpreter, for purposes of
authorization, the user is the logged-in Linux user and the groups are the Linux groups that user is a member
of. When you access Impala through the ODBC or JDBC interfaces, the user and password specified through the
connection string are used as login credentials for the Linux server, and authorization is based on that user
name and the associated Linux group membership.
In the [roles] section, you a set of roles. For each role, you specify precisely the set of privileges is available.
That is, which objects users with that role can access, and what operations they can perform on those objects.
This is the lowest-level category of security information; the other sections in the policy file map the privileges
to higher-level divisions of groups and users. In the [groups] section, you specify which roles are associated
with which groups. The group and user names correspond to Linux groups and users on the server where the
impalad daemon runs. The privileges are specified using patterns like:
server=server_name->db=database_name->table=table_name->action=SELECT
server=server_name->db=database_name->table=table_name->action=CREATE
server=server_name->db=database_name->table=table_name->action=ALL

For the server_name value, substitute the same symbolic name you specify with the impalad -server_name
option. You can use * wildcard characters at each level of the privilege specification to allow access to all such
objects. For example:
server=impala-host.example.com->db=default->table=t1->action=SELECT
server=impala-host.example.com->db=*->table=*->action=CREATE
server=impala-host.example.com->db=*->table=audit_log->action=SELECT
server=impala-host.example.com->db=default->table=t1->action=*

When authorization is enabled, Impala uses the policy file as a whitelist, representing every privilege available
to any user on any object. That is, only operations specified for the appropriate combination of object, role, group,
and user are allowed; all other operations are not allowed. If a group or role is defined multiple times in the
policy file, the last definition takes precedence.
To understand the notion of whitelisting, set up a minimal policy file that does not provide any privileges for any
object. When you connect to an Impala node where this policy file is in effect, you get no results for SHOW
DATABASES, and an error when you issue any SHOW TABLES, USE database_name, DESCRIBE table_name,
SELECT, and or other statements that expect to access databases or tables, even if the corresponding databases
and tables exist.
The contents of the policy file are cached, to avoid a performance penalty for each query. The policy file is
re-checked by each impalad node every 5 minutes. When you make a non-time-sensitive change such as adding
new privileges or new users, you can let the change take effect automatically a few minutes later. If you remove
or reduce privileges, and want the change to take effect immediately, restart the impalad daemon on all nodes,
again specifying the -server_name and -authorization_policy_file options so that the rules from the
updated policy file are applied.

Examples of Policy File Rules for Security Scenarios


The following examples show rules that might go in the policy file to deal with various authorization-related
scenarios. For illustration purposes, this section shows several very small policy files with only a few rules each.
In your environment, typically you would define many roles to cover all the scenarios involving your own databases,
tables, and applications, and a smaller number of groups, whose members are given the privileges from one or
more roles.
A User with No Privileges
If a user has no privileges at all, that user cannot access any schema objects in the system. The error messages
do not disclose the names or existence of objects that the user is not authorized to read.
This is the experience you want a user to have if they somehow log into a system where they are not an authorized
Impala user. In a real deployment with a filled-in policy file, a user might have no privileges because they are
not a member of any of the relevant groups mentioned in the policy file.

CDH 5 Security Guide | 113

Impala Security Configuration


Examples of Privileges for Administrative Users
When an administrative user has broad access to tables or databases, the associated rules in the [roles]
section typically use wildcards and/or inheritance. For example, in the following sample policy file, db=* refers
to all databases and db=*->table=* refers to all tables in all databases.
Omitting the rightmost portion of a rule means that the privileges apply to all the objects that could be specified
there. For example, in the following sample policy file, the all_databases role has all privileges for all tables
in all databases, while the one_database role has all privileges for all tables in one specific database. The
all_databases role does not grant privileges on URIs, so a group with that role could not issue a CREATE TABLE
statement with a LOCATION clause. The entire_server role has all privileges on both databases and URIs
within the server.
[groups]
supergroup = all_databases
[roles]
read_all_tables = server=server1->db=*->table=*->action=SELECT
all_tables = server=server1->db=*->table=*
all_databases = server=server1->db=*
one_database = server=server1->db=test_db
entire_server = server=server1

A User with Privileges for Specific Databases and Tables


If a user has privileges for specific tables in specific databases, the user can access those things but nothing
else. They can see the tables and their parent databases in the output of SHOW TABLES and SHOW DATABASES,
USE the appropriate databases, and perform the relevant actions (SELECT and/or INSERT) based on the table
privileges. To actually create a table requires the ALL privilege at the database level, so you might define separate
roles for the user that sets up a schema and other users or applications that perform day-to-day operations
on the tables.
The following sample policy file shows some of the syntax that is appropriate as the policy file grows, such as
the # comment syntax, \ continuation syntax, and comma separation for roles assigned to groups or privileges
assigned to roles.
[groups]
cloudera = training_sysadmin, instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training, \
server=server1->db=instructor_private, \
server=server1->db=lesson_development
instructor = server=server1->db=training->table=*->action=*, \
server=server1->db=instructor_private->table=*->action=*, \
server=server1->db=lesson_development->table=lesson*
# This particular course is all about queries, so the students can SELECT but not INSERT
or CREATE/DROP.
student = server=server1->db=training->table=lesson_*->action=SELECT

Privileges for Working with External Data Files


When data is being inserted through the LOAD DATA statement, or is referenced from an HDFS location outside
the normal Impala database directories, the user also needs appropriate permissions on the URIs corresponding
to those HDFS locations.
In this sample policy file:
The external_table role lets us insert into and query the Impala table, external_table.sample.
The staging_dir role lets us specify the HDFS path /user/cloudera/external_data with the LOAD DATA
statement. Remember, when Impala queries or loads data files, it operates on all the files in that directory,
not just a single file, so any Impala LOCATION parameters refer to a directory rather than an individual file.

114 | CDH 5 Security Guide

Impala Security Configuration


We included the IP address and port of the Hadoop name node in the HDFS URI of the staging_dir rule.
We found those details in /etc/hadoop/conf/core-site.xml, under the fs.default.name element. That
is what we use in any roles that specify URIs (that is, the locations of directories in HDFS).
We start this example after the table external_table.sample is already created. In the policy file for the
example, we have already taken away the external_table_admin role from the cloudera group, and
replaced it with the lesser-privileged external_table role.
We assign privileges to a subdirectory underneath /user/cloudera in HDFS, because such privileges also
apply to any subdirectories underneath. If we had assigned privileges to the parent directory /user/cloudera,
it would be too likely to mess up other files by specifying a wrong location by mistake.
The cloudera under the [groups] section refers to the cloudera group. (In the demo VM used for this
example, there is a cloudera user that is a member of a cloudera group.)
Policy file:
[groups]
cloudera = external_table, staging_dir
[roles]
external_table_admin = server=server1->db=external_table
external_table = server=server1->db=external_table->table=sample->action=*
staging_dir =
server=server1->uri=hdfs://127.0.0.1:8020/user/cloudera/external_data->action=*
impala-shell session:
[localhost:21000] > use external_table;
Query: use external_table
[localhost:21000] > show tables;
Query: show tables
Query finished, fetching results ...
+--------+
| name
|
+--------+
| sample |
+--------+
Returned 1 row(s) in 0.02s
[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-----+
| x
|
+-----+
| 1
|
| 5
|
| 150 |
+-----+
Returned 3 row(s) in 1.04s
[localhost:21000] > load data inpath '/user/cloudera/external_data' into table sample;
Query: load data inpath '/user/cloudera/external_data' into table sample
Query finished, fetching results ...
+----------------------------------------------------------+
| summary
|
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.26s
[localhost:21000] > select * from sample;
Query: select * from sample
Query finished, fetching results ...
+-------+
| x
|
+-------+
| 2
|
| 4
|
| 6
|
| 8
|

CDH 5 Security Guide | 115

Impala Security Configuration


| 64738 |
| 49152 |
| 1
|
| 5
|
| 150
|
+-------+
Returned 9 row(s) in 0.22s
[localhost:21000] > load data inpath '/user/cloudera/unauthorized_data' into table
sample;
Query: load data inpath '/user/cloudera/unauthorized_data' into table sample
ERROR: AuthorizationException: User 'cloudera' does not have privileges to access:
hdfs://127.0.0.1:8020/user/cloudera/unauthorized_data

Controlling Access at the Column Level through Views


If a user has SELECT privilege for a view, they can query the view, even if they do not have any privileges on the
underlying table. To see the details about the underlying table through EXPLAIN or DESCRIBE FORMATTED
statements on the view, the user must also have SELECT privilege for the underlying table.
Important:
The types of data that are considered sensitive and confidential differ depending on the jurisdiction
the type of industry, or both. For fine-grained access controls, set up appropriate privileges based on
all applicable laws and regulations.
Be careful using the ALTER VIEW statement to point an existing view at a different base table or a
new set of columns that includes sensitive or restricted data. Make sure that any users who have
SELECT privilege on the view do not gain access to any additional information they are not authorized
to see.
The following example shows how a system administrator could set up a table containing some columns with
sensitive information, then create a view that only exposes the non-confidential columns.
[localhost:21000]
>
>
>
>
>
>
[localhost:21000]
sensitive_info;

> create table sensitive_info


(
name string,
address string,
credit_card string,
taxpayer_id string
);
> create view name_address_view as select name, address from

Then the following policy file specifies read-only privilege for that view, without authorizing access to the
underlying table:
[groups]
cloudera = view_only_privs
[roles]
view_only_privs = server=server1->db=reports->table=name_address_view->action=SELECT

Thus, a user with the view_only_privs role could access through Impala queries the basic information but not
the sensitive information, even if both kinds of information were part of the same data file.
You might define other views to allow users from different groups to query different sets of columns.
Separating Administrator Responsibility from Read and Write Privileges
Remember that to create a database requires full privilege on that database, while day-to-day operations on
tables within that database can be performed with lower levels of privilege on specific table. Thus, you might
set up separate roles for each database or application: an administrative one that could create or drop the
database, and a user-level one that can access only the relevant tables.

116 | CDH 5 Security Guide

Impala Security Configuration


For example, this policy file divides responsibilities between users in 3 different groups:
Members of the supergroup group have the training_sysadmin role and so can set up a database named
training.
Members of the cloudera group have the instructor role and so can create, insert into, and query any
tables in the training database, but cannot create or drop the database itself.
Members of the visitor group have the student role and so can query those tables in the training
database.
[groups]
supergroup = training_sysadmin
cloudera = instructor
visitor = student
[roles]
training_sysadmin = server=server1->db=training
instructor = server=server1->db=training->table=*->action=*
student = server=server1->db=training->table=*->action=SELECT

Using Multiple Policy Files for Different Databases


For an Impala cluster with many databases being accessed by many users and applications, it might be
cumbersome to update the security policy file for each privilege change or each new database, table, or view.
You can allow security to be managed separately for individual databases, by setting up a separate policy file
for each database:
Add the optional [databases] section to the main policy file.
Add entries in the [databases] section for each database that has its own policy file.
For each listed database, specify the HDFS path of the appropriate policy file.
For example:
[databases]
# Defines the location of the per-DB policy files for the 'customers' and 'sales'
databases.
customers = hdfs://ha-nn-uri/etc/access/customers.ini
sales = hdfs://ha-nn-uri/etc/access/sales.ini

To enable URIs in per-DB policy files, add the following string in the Cloudera Manager field Impala Service
Environment Advanced Configuration Snippet (Safety Valve):
JAVA_TOOL_OPTIONS="-Dsentry.allow.uri.db.policyfile=true"

Important: Enabling URIs in per-DB policy files introduces a security risk by allowing the owner of
the db-level policy file to grant himself/herself load privileges to anything the impala user has read
permissions for in HDFS (including data in other databases controlled by different db-level policy
files).

Setting Up Schema Objects for a Secure Impala Deployment


Remember that in your role definitions, you specify privileges at the level of individual databases and tables, or
all databases or all tables within a database. To simplify the structure of these rules, plan ahead of time how
to name your schema objects so that data with different authorization requirements is divided into separate
databases.
If you are adding security on top of an existing Impala deployment, remember that you can rename tables or
even move them between databases using the ALTER TABLE statement. In Impala, creating new databases is
a relatively inexpensive operation, basically just creating a new directory in HDFS.

CDH 5 Security Guide | 117

Impala Security Configuration


You can also plan the security scheme and set up the policy file before the actual schema objects named in the
policy file exist. Because the authorization capability is based on whitelisting, a user can only create a new
database or table if the required privilege is already in the policy file: either by listing the exact name of the object
being created, or a * wildcard to match all the applicable objects within the appropriate container.

Privilege Model and Object Hierarchy


Privileges can be granted on different objects in the schema. Any privilege that can be granted is associated
with a level in the object hierarchy. If a privilege is granted on a container object in the hierarchy, the child object
automatically inherits it. This is the same privilege model as Hive and other database systems such as MySQL.
The kinds of objects in the schema hierarchy are:
Server
URI
Database
Table

The server name is specified by the -server_name option when impalad starts. Specify the same name for all
impalad nodes in the cluster.
URIs represent the HDFS paths you specify as part of statements such as CREATE EXTERNAL TABLE and LOAD
DATA. Typically, you specify what look like UNIX paths, but these locations can also be prefixed with hdfs:// to
make clear that they are really URIs. To set privileges for a URI, specify the name of a directory, and the privilege
applies to all the files in that directory and any directories underneath it.
There are not separate privileges for individual table partitions or columns. To specify read privileges at this
level, you create a view that queries specific columns and/or partitions from a base table, and give SELECT
privilege on the view but not the underlying table. See Views for details about views in Impala.
URIs must start with either hdfs:// or file://. If a URI starts with anything else, it will cause an exception
and the policy file will be invalid. When defining URIs for HDFS, you must also specify the NameNode. For example:
data_read = server=server1->uri=file:///path/to/dir, \
server=server1->uri=hdfs://namenode:port/path/to/dir

Warning:
Because the NameNode host and port must be specified, Cloudera strongly recommends you use
High Availability (HA). This ensures that the URI will remain constant even if the namenode changes.
data_read = server=server1->uri=file:///path/to/dir,\
server=server1->uri=hdfs://ha-nn-uri/path/to/dir

Table 10: Valid privilege types and objects they apply to


Privilege

Object

INSERT

DB, TABLE

SELECT

DB, TABLE

ALL

SERVER, TABLE, DB, URI

118 | CDH 5 Security Guide

Impala Security Configuration


Note:
Although this document refers to the ALL privilege, currently if you use the policy file mode, you do
not use the actual keyword ALL in the policy file. When you code role entries in the policy file:
To specify the ALL privilege for a server, use a role like server=server_name.
To specify the ALL privilege for a database, use a role like
server=server_name->db=database_name.
To specify the ALL privilege for a table, use a role like
server=server_name->db=database_name->table=table_name->action=*.

Operation

Scope

Privileges

EXPLAIN

TABLE

SELECT

LOAD DATA

TABLE

INSERT

CREATE DATABASE

SERVER

ALL

DROP DATABASE

DATABASE

ALL

CREATE TABLE

DATABASE

ALL

DROP TABLE

TABLE

ALL

DESCRIBE TABLE

TABLE

SELECT/INSERT

ALTER TABLE .. ADD


COLUMNS

TABLE

ALL

ALTER TABLE .. REPLACE TABLE


COLUMNS

ALL

ALTER TABLE .. CHANGE TABLE


column

ALL

ALTER TABLE .. RENAME TABLE

ALL

ALTER TABLE .. SET


TBLPROPERTIES

TABLE

ALL

ALTER TABLE .. SET


FILEFORMAT

TABLE

ALL

ALTER TABLE .. SET


LOCATION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION

TABLE

ALL

ALTER TABLE .. ADD


PARTITION location

TABLE

ALL

ALTER TABLE .. DROP


PARTITION

TABLE

ALL

ALTER TABLE ..
PARTITION SET
FILEFORMAT

TABLE

ALL

ALTER TABLE .. SET


SERDEPROPERTIES

TABLE

ALL

CREATE VIEW

DATABASE; SELECT on ALL


TABLE

URI

Others

URI

URI

URI

SELECT on TABLE

CDH 5 Security Guide | 119

Impala Security Configuration


Operation

Scope

Privileges

DROP VIEW

VIEW/TABLE

ALL

ALTER VIEW

You need ALL privilege ALL, SELECT


on the named view
and the parent
database, plus SELECT
privilege for any tables
or views referenced by
the view query. Once
the view is created or
altered by a
high-privileged system
administrator, it can
be queried by a
lower-privileged user
who does not have full
query privileges for the
base tables. (This is
how you implement
column-level security.)

ALTER TABLE .. SET


LOCATION

TABLE

ALL

CREATE EXTERNAL
TABLE

Database (ALL), URI


(SELECT)

ALL, SELECT

SELECT

TABLE

SELECT

USE <dbName>

Any

CREATE FUNCTION

SERVER

ALL

DROP FUNCTION

SERVER

ALL

REFRESH <table name>

TABLE

SELECT/INSERT

INVALIDATE METADATA

SERVER

ALL

INVALIDATE METADATA
<table name>

TABLE

SELECT/INSERT

COMPUTE STATS

TABLE

ALL

SHOW TABLE STATS,


SHOW PARTITIONS

TABLE

SELECT/INSERT

SHOW COLUMN STATS

TABLE

SELECT/INSERT

SHOW FUNCTIONS

DATABASE

SELECT

SHOW TABLES

No special
privileges needed
to issue the
statement, but
only shows objects
you are authorized
for

SHOW DATABASES,
SHOW SCHEMAS

No special
privileges needed
to issue the

120 | CDH 5 Security Guide

URI

URI

Others

Impala Security Configuration


Operation

Scope

Privileges

URI

Others

statement, but
only shows objects
you are authorized
for

Debugging Failed Sentry Authorization Requests


Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why
Sentry is denying access, the best way to debug is to temporarily turn on debug logging:
In Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the logging settings for your service
through the corresponding Logging Safety Valve field for the Impala, Hive Server 2, or Solr Server services.
On systems not managed by Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the
log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as:
FilePermission server..., RequestPermission server...., result [true|false]

which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while
RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate
FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating
Access Denied.

Configuring Per-User Access for Hue


When users connect to Impala directly through the impala-shell interpreter, the Impala authorization feature
determines what actions they can take and what data they can see. When users submit Impala queries through
a separate application, such as Hue, typically all requests are treated as coming from the same user. In Impala
1.2 and higher, authorization is extended by a new feature that allows applications to pass along credentials
for the users that connect to them, and issue Impala queries with the privileges for those users. This feature is
known as delegation. Currently, the delegation feature is available only for Impala queries submitted through
the Hue interface; for example, Impala cannot run queries using the privileges of the HDFS user.
Impala 1.2 adds a new startup option for impalad, --authorized_proxy_user_config. When you specify this
option, users whose names you specify (such as hue) can impersonate another user. The name of the user
whose privileges are used is passed using the HiveServer2 configuration property impala.doas.user.
You can specify a list of users that the application user can impersonate, or * to allow a superuser to impersonate
any other user. For example:
impalad --authorized_proxy_user_config 'hue=user1,user2;admin=*' ...

Note: Make sure to use single quotes or escape characters to ensure that any * characters do not
undergo wildcard expansion when specified in command-line arguments.
See Modifying Impala Startup Options for details about adding or changing impalad startup options. See this
Cloudera blog post for background information about the impersonation capability in HiveServer2.

Managing Sentry for Impala through Cloudera Manager


To enable the Sentry service for Impala and Hive, set Hive/Impala > Service-Wide > Sentry Service parameter
to the Sentry service. Then restart Impala and Hive. Simply adding Sentry service as a dependency and restarting
enables Impala and Hive to use the Sentry service.
To set the server name to use when granting server level privileges, set the Hive > Service-Wide > Advanced >
Server Name for Sentry Authorization parameter. When using Sentry with the Hive Metastore, you can specify
CDH 5 Security Guide | 121

Impala Security Configuration


the list of users that are allowed to bypass Sentry Authorization in Hive Metastore using Hive > Service-Wide
> Security > Bypass Sentry Authorization Users. These are usually service users that already ensure all activity
has been authorized.
Note: The Hive/Impala > Service-Wide > Policy File Based Sentry tab contains parameters only
relevant to configuring Sentry using policy files. In particular, make sure that Enable Sentry
Authorization using Policy Files parameter is unchecked when using the Sentry service. Cloudera
Manager throws a validation error if you attempt to configure the Sentry service and policy file at the
same time.

The DEFAULT Database in a Secure Deployment


Because of the extra emphasis on granular access controls in a secure deployment, you should move any
important or sensitive information out of the DEFAULT database into a named database whose privileges are
specified in the policy file. Sometimes you might need to give privileges on the DEFAULT database for
administrative reasons; for example, as a place you can reliably specify with a USE statement when preparing
to drop a database.

Enabling Kerberos Authentication for Impala


Impala supports Kerberos authentication. For more information on enabling Kerberos authentication, see the
topic on Configuring Hadoop Security in the CDH4 Security Guide or the CDH 5 Security Guide.
Impala currently does not support application data wire encryption.
When using Impala in a managed environment, Cloudera Manager automatically completes Kerberos configuration.
In an unmanaged environment, create a Kerberos principal for each host running impalad or statestored.
Cloudera recommends using a consistent format, such as impala/_HOST@Your-Realm, but you can use any
three-part Kerberos server principal.
Note: Regardless of the authentication mechanism used, Impala always creates HDFS directories
and data files owned by the same user (typically impala). To implement user-level access to different
databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained
in Enabling Sentry Authorization for Impala on page 110.

Requirements for Using Impala with Kerberos


On version 5 of Red Hat Enterprise Linux and comparable distributions, some additional setup is needed for the
impala-shell interpreter to connect to a Kerberos-enabled Impala cluster:
sudo yum install python-devel openssl-devel python-pip
sudo pip-python install ssl

Important:
If you plan to use Impala in your cluster, you must configure your KDC to allow tickets to be renewed,
and you must configure krb5.conf to request renewable tickets. Typically, you can do this by adding
the max_renewable_life setting to your realm in kdc.conf, and by adding the renew_lifetime
parameter to the libdefaults section of krb5.conf. For more information about renewable tickets,
see the Kerberos documentation.
Currently, you cannot use the resource management feature in CDH 5 on a cluster that has Kerberos
authentication enabled.
Start all impalad and statestored daemons with the --principal and --keytab-file flags set to the
principal and full path name of the keytab file containing the credentials for the principal.
122 | CDH 5 Security Guide

Impala Security Configuration


Impala supports the Cloudera ODBC driver and the Kerberos interface provided. To use Kerberos through the
ODBC driver, the host type must be set depending on the level of the ODBD driver:

SecImpala for the ODBC 1.0 driver.


SecBeeswax for the ODBC 1.2 driver.

Blank for the ODBC 2.0 driver or higher, when connecting to a secure cluster.
HS2NoSasl for the ODBC 2.0 driver or higher, when connecting to a non-secure cluster.

To enable Kerberos in the Impala shell, start the impala-shell command using the -k flag.
To enable Impala to work with Kerberos security on your Hadoop cluster, make sure you perform the installation
and configuration steps in the topic on Configuring Hadoop Security in the CDH4 Security Guide or the CDH 5
Security Guide. Also note that when Kerberos security is enabled in Impala, a web browser that supports Kerberos
HTTP SPNEGO is required to access the Impala web console (for example, Firefox, Internet Explorer, or Chrome).
If the NameNode, Secondary NameNode, DataNode, JobTracker, TaskTrackers, ResourceManager, NodeManagers,
HttpFS, Oozie, Impala, or Impala statestore services are configured to use Kerberos HTTP SPNEGO authentication,
and two or more of these services are running on the same host, then all of the running services must use the
same HTTP principal and keytab file used for their HTTP endpoints.

Configuring Impala to Support Kerberos Security


Enabling Kerberos authentication for Impala involves steps that can be summarized as follows:
Creating service principals for Impala and the HTTP service. Principal names take the form:
serviceName/[email protected]

Creating, merging, and distributing key tab files for these principals.
Editing /etc/default/impala (in cluster not managed by Cloudera Manager), or editing the Security settings
in the Cloudera Manager interface, to accommodate Kerberos authentication.

Enabling Kerberos for Impala


1. Create an Impala service principal, specifying the name of the OS user that the Impala daemons run under,
the fully qualified domain name of each node running impalad, and the realm name. For example:
$ kadmin
kadmin: addprinc -requires_preauth -randkey
impala/[email protected]

2. Create an HTTP service principal. For example:


kadmin: addprinc -randkey HTTP/[email protected]

Note: The HTTP component of the service principal must be uppercase as shown in the preceding
example.
3. Create keytab files with both principals. For example:
kadmin: xst -k impala.keytab impala/impala_host.example.com
kadmin: xst -k http.keytab HTTP/impala_host.example.com
kadmin: quit

4. Use ktutil to read the contents of the two keytab files and then write those contents to a new file. For
example:
$ ktutil
ktutil: rkt impala.keytab
ktutil: rkt http.keytab
ktutil: wkt impala-http.keytab
ktutil: quit

CDH 5 Security Guide | 123

Impala Security Configuration


5. (Optional) Test that credentials in the merged keytab file are valid, and that the renew until date is in the
future. For example:
$ klist -e -k -t impala-http.keytab

6. Copy the impala-http.keytab file to the Impala configuration directory. Change the permissions to be only
read for the file owner and change the file owner to the impala user. By default, the Impala user and group
are both named impala. For example:
$
$
$
$

cp impala-http.keytab /etc/impala/conf
cd /etc/impala/conf
chmod 400 impala-http.keytab
chown impala:impala impala-http.keytab

7. Add Kerberos options to the Impala defaults file, /etc/default/impala. Add the options for both the
impalad and statestored daemons, using the IMPALA_SERVER_ARGS and IMPALA_STATE_STORE_ARGS
variables. For example, you might add:
-kerberos_reinit_interval=60
-principal=impala_1/[email protected]
-keytab_file=/var/run/cloudera-scm-agent/process/3212-impala-IMPALAD/impala.keytab

For more information on changing the Impala defaults specified in /etc/default/impala, see Modifying
Impala Startup Options.
Note: Restart impalad and statestored for these configuration changes to take effect.

Enabling Kerberos for Impala with a Proxy Server


A common configuration for Impala with High Availability is to use a proxy server to submit requests to the
actual impalad daemons on different hosts in the cluster. This configuration avoids connection problems in
case of machine failure, because the proxy server can route new requests through one of the remaining hosts
in the cluster. This configuration also helps with load balancing, because the additional overhead of being the
coordinator node for each query is spread across multiple hosts.
Although you can set up a proxy server with or without Kerberos authentication, typically users set up a secure
Kerberized configuration. For information about setting up a proxy server for Impala, including Kerberos-specific
steps, see Using Impala through a Proxy for High Availability.

Using a Web Browser to Access a URL Protected by Kerberos HTTP SPNEGO


Your web browser must support Kerberos HTTP SPNEGO. For example, Chrome, Firefox, or Internet Explorer.
To configure Firefox to access a URL protected by Kerberos HTTP SPNEGO:
1. Open the advanced settings Firefox configuration page by loading the about:config page.
2. Use the Filter text box to find network.negotiate-auth.trusted-uris.
3. Double-click the network.negotiate-auth.trusted-uris preference and enter the hostname or the
domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and
hostnames with a comma.
4. Click OK.

Enabling LDAP Authentication for Impala


Authentication is the process of allowing only specified named users to access the server (in this case, the Impala
server). This feature is crucial for any production deployment, to prevent misuse, tampering, or excessive load

124 | CDH 5 Security Guide

Impala Security Configuration


on the server. Impala users LDAP for authentication, verifying the credentials of each user who connects through
impala-shell, Hue, a Business Intelligence tool, JDBC or ODBC application, and so on.
Note: Regardless of the authentication mechanism used, Impala always creates HDFS directories
and data files owned by the same user (typically impala). To implement user-level access to different
databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained
in Enabling Sentry Authorization for Impala on page 110.
Versions:
Authentication against LDAP servers is available in Impala 1.2.2 and higher. Impala 1.4.0 adds support for secure
LDAP authentication through SSL and TLS.
Other aspects of authentication:
Only client->Impala connections can be authenticated by LDAP. Kerberos is the only authentication mechanism
for connections between internal components, such as between the Impala, statestore, and catalog daemons.
See Enabling Kerberos Authentication for Impala on page 122 for how to set up Kerberos for Impala.
Server-side LDAP setup:
These requirements apply on the server side when configuring and starting Impala:
To enable LDAP authentication, set the following startup options for impalad:
--enable_ldap_auth enables LDAP-based authentication between the client and Impala.
--ldap_uri sets the URI of the LDAP server to use. Typically, the URI is prefixed with ldap://. In Impala
1.4.0 and higher, you can specify secure SSL-based LDAP transport by using the prefix ldaps://. The URI
can optionally specify the port, for example: ldap://ldap_server.cloudera.com:389 or
ldaps://ldap_server.cloudera.com:636. (389 and 636 are the default ports for non-SSL and SSL LDAP
connections, respectively.)
For ldaps:// connections secured by SSL, --ldap_ca_certificate="/path/to/certificate/pem"
specifies the location of the certificate in standard .PEM format. Store this certificate on the local filesystem,
in a location that only the impala user and other trusted users can read.
Support for custom bind strings:
When Impala connects to LDAP it issues a bind call to the LDAP server to authenticate as the connected user.
Impala clients, including the Impala shell, provide the short name of the user to Impala. This is necessary so
that Impala can use Sentry for role-based access, which uses short names.
However, LDAP servers often require more complex, structured usernames for authentication. Impala supports
three ways of transforming the short name (for example, 'henry') to a more complicated string. If necessary,
specify one of the following configuration options when starting the impalad daemon on each data node:
--ldap_domain: Replaces the username with a string username@ldap_domain.
--ldap_baseDN: Replaces the username with a distinguished name (DN) of the form:
uid=userid,ldap_baseDN. (This is equivalent to a Hive option).
--ldap_bind_pattern: This is the most general option, and replaces the username with the string
ldap_bind_pattern where all instances of the string #UID are replaced with userid. For example, an
ldap_bind_pattern of "user=#UID,OU=foo,CN=bar" with a username of henry will construct a bind
name of "user=henry,OU=foo,CN=bar".
These options are mutually exclusive; Impala does not start if more than one of these options is specified.
Secure LDAP connections:
To avoid sending credentials over the wire in cleartext, you must configure a secure connection between both
the client and Impala, and between Impala and the LDAP server. The secure connection could use SSL or TLS.
Secure LDAP connections through SSL:
For SSL-enabled LDAP connections, specify a prefix of ldaps:// instead of ldap://. Also, the default port for
SSL-enabled LDAP connections is 636 instead of 389.
CDH 5 Security Guide | 125

Impala Security Configuration


Secure LDAP connections through TLS:
TLS, the successor to the SSL protocol, is supported by most modern LDAP servers. Unlike SSL connections, TLS
connections can be made on the same server port as non-TLS connections. To secure all connections using TLS,
specify the following flags as startup options to the impalad daemon:
--ldap_tls tells Impala to start a TLS connection to the LDAP server, and to fail authentication if it cannot
be done.
--ldap_ca_certificate="/path/to/certificate/pem" specifies the location of the certificate in standard
.PEM format. Store this certificate on the local filesystem, in a location that only the impala user and other
trusted users can read.
LDAP authentication for impala-shell interpreter:
To connect to Impala using LDAP authentication, you specify command-line options to the impala-shell
command interpreter and enter the password when prompted:
-l enables LDAP authentication.
-u sets the user. Per Active Directory, the user is the short user name, not the full LDAP distinguished name.
If your LDAP settings include a search base, use the --ldap_bind_pattern on the impalad daemon to
translate the short user name from impala-shell automatically to the fully qualified name.
impala-shell automatically prompts for the password.
For the full list of available impala-shell options, see impala-shell Command-Line Options.
LDAP authentication for JDBC applications: See Configuring Impala to Work with JDBC for the format to use with
the JDBC connection string for servers using LDAP authentication.
Restrictions:
The LDAP support is preliminary. It currently has only been tested against Active Directory.

Using Multiple Authentication Methods with Impala


If your cluster includes some nodes authentication through Kerberos, while others employ LDAP authentication,
you can configure your network load balancer to forward both kinds of requests to a data node that is set up
with the appropriate authentication type. Once the initial request is made using either Kerberos or LDAP
authentication, Impala automatically handles the process of coordinating the work across multiple nodes and
transmitting intermediate results back to the coordinator node.
This technique is most suitable for larger clusters, where you are already using load balancing software for high
availability. You configure Impala to run on a different port on the nodes configured for LDAP. Then you configure
the load balancing software to forward Kerberos connection requests to nodes using the default port, and LDAP
connection requests to nodes using an alternative port for LDAP. Consult the documentation for your load
balancing software for how to configure that type of forwarding.

Auditing Impala Operations


To monitor how Impala data is being used within your organization, ensure that your Impala authorization and
authentication policies are effective, and detect attempts at intrusion or unauthorized access to Impala data,
you can use the auditing feature in Impala 1.1.1 and higher:
Enable auditing by including the option -audit_event_log_dir=directory_path in your impalad startup
options. The path refers to a local directory on the server, not an HDFS directory.
Decide how many queries will be represented in each log files. By default, Impala starts a new log file every
5000 queries. To specify a different number, include the option
-max_audit_event_log_file_size=number_of_queries in the impalad startup options. Limiting the
size lets you manage disk space by archiving older logs, and reduce the amount of text to process when
analyzing activity for a particular period.
126 | CDH 5 Security Guide

Impala Security Configuration


Configure the Cloudera Navigator product to collect and consolidate the audit logs from all the nodes in the
cluster.
Use the Cloudera Manager product to filter, visualize, and produce reports based on the audit data. (The
Impala auditing feature works with Cloudera Manager 4.7 or higher.) Check the audit data to ensure that all
activity is authorized and/or detect attempts at unauthorized access.

Durability and Performance Considerations for Impala Auditing


The auditing feature only imposes performance overhead while auditing is enabled.
Because any Impala node can process a query, enable auditing on all nodes where the impalad daemon runs.
Each node stores its own log files, in a directory in the local filesystem. The log data is periodically flushed to
disk (through an fsync() system call) to avoid loss of audit data in case of a crash.
The runtime overhead of auditing applies to whichever node serves as the coordinator for the query, that is, the
node you connect to when you issue the query. This might be the same node for all queries, or different
applications or users might connect to and issue queries through different nodes.
To avoid excessive I/O overhead on busy coordinator nodes, Impala syncs the audit log data (using the fsync()
system call) periodically rather than after every query. Currently, the fsync() calls are issued at a fixed interval,
every 5 seconds.
By default, Impala avoids losing any audit log data in the case of an error during a logging operation (such as a
disk full error), by immediately shutting down the impalad daemon on the node where the auditing problem
occurred. You can override this setting by specifying the option -abort_on_failed_audit_event=false in
the impalad startup options.

Format of the Audit Log Files


The audit log files represent the query information in JSON format, one query per line. Typically, rather than
looking at the log files themselves, you use the Cloudera Navigator product to consolidate the log data from all
Impala nodes and filter and visualize the results in useful ways. (If you do examine the raw log data, you might
run the files through a JSON pretty-printer first.)
All the information about schema objects accessed by the query is encoded in a single nested record on the
same line. For example, the audit log for an INSERT ... SELECT statement records that a select operation
occurs on the source table and an insert operation occurs on the destination table. The audit log for a query
against a view records the base table accessed by the view, or multiple base tables in the case of a view that
includes a join query. Every Impala operation that corresponds to a SQL statement is recorded in the audit logs,
whether the operation succeeds or fails. Impala records more information for a successful operation than for a
failed one, because an unauthorized query is stopped immediately, before all the query planning is completed.
Impala records more information for a successful operation than for a failed one, because an unauthorized query
is stopped immediately, before all the query planning is completed.
The information logged for each query includes:
Client session state:
Session ID
User name
Network address of the client connection
SQL statement details:

Query ID
Statement Type - DML, DDL, and so on
SQL statement text
Execution start time, in local time
Execution Status - Details on any errors that were encountered
Target Catalog Objects:
CDH 5 Security Guide | 127

Impala Security Configuration


Object Type - Table, View, or Database
Fully qualified object name
Privilege - How the object is being used (SELECT, INSERT, CREATE, and so on)

Which Operations Are Audited


The kinds of SQL queries represented in the audit log are:
Queries that are prevented due to lack of authorization.
Queries that Impala can analyze and parse to determine that they are authorized. The audit data is recorded
immediately after Impala finishes its analysis, before the query is actually executed.
The audit log does not contain entries for queries that could not be parsed and analyzed. For example, a query
that fails due to a syntax error is not recorded in the audit log. The audit log also does not contain queries that
fail due to a reference to a table that does not exist, if you would be authorized to access the table if it did exist.
Certain statements in the impala-shell interpreter, such as CONNECT, SUMMARY, PROFILE, SET, and QUIT, do
not correspond to actual SQL queries, and these statements are not reflected in the audit log.

Reviewing the Audit Logs


You typically do not review the audit logs in raw form. The Cloudera Manager agent periodically transfers the
log information into a back-end database where it can be examined in consolidated form. See the Cloudera
Navigator documentation for details.

128 | CDH 5 Security Guide

Hive Security Configuration

Hive Security Configuration


Here is a summary of the status of Hive security in CDH 5:
Sentry enables role-based, fine-grained authorization for HiveServer2. See Sentry Policy File Configuration
on page 43.
HiveServer2 supports authentication of the Thrift client using Kerberos or user/password validation backed
by LDAP. For configuration instructions, see HiveServer2 Security Configuration.
Earlier versions of HiveServer do not support Kerberos authentication for clients. However, the Hive
MetaStoreServer does support Kerberos authentication for Thrift clients. For configuration instructions, see
Hive MetaStoreServer Security Configuration.
See also: Using Hive to Run Queries on a Secure HBase Server on page 137

HiveServer2 Security Configuration


HiveServer2 supports authentication of the Thrift client using either of these methods:
Kerberos authentication
LDAP authentication
If Kerberos authentication is used, authentication is supported between the Thrift client and HiveServer2, and
between HiveServer2 and secure HDFS. If LDAP authentication is used, authentication is supported only between
the Thrift client and HiveServer2.

Enabling Kerberos Authentication for HiveServer2


If you configure HiveServer2 to use Kerberos authentication, HiveServer2 acquires a Kerberos ticket during
start-up. HiveServer2 requires a principal and keytab file specified in the configuration. The client applications
(for example JDBC or Beeline) must get a valid Kerberos ticket before initiating a connection to HiveServer2.
Configuring HiveServer2 for Kerberos-Secured Clusters
To enable Kerberos Authentication for HiveServer2, add the following properties in the
/etc/hive/conf/hive-site.xml file:
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/[email protected]</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/etc/hive/conf/hive.keytab</value>
</property>

where:
hive.server2.authentication in particular, is a client-facing property that controls the type of
authentication HiveServer2 uses for connections to clients. In this case, HiveServer2 uses Kerberos to
authenticate incoming clients.
The [email protected] value in the example above is the Kerberos principal for the host where
HiveServer2 is running. The special string _HOST in the properties is replaced at run-time by the fully-qualified
domain name of the host machine where the daemon is running. This requires that reverse DNS is properly

CDH 5 Security Guide | 129

Hive Security Configuration


working on all the hosts configured this way. Replace YOUR-REALM.COM with the name of the Kerberos realm
your Hadoop cluster is in.
The /etc/hive/conf/hive.keytab value in the example above is a keytab file for that principal.
If you configure HiveServer2 to use both Kerberos authentication and secure impersonation, JDBC clients and
Beeline can specify an alternate session user. If these clients have proxy user privileges, HiveServer2 will
impersonate the alternate user instead of the one connecting. The alternate user can be specified by the JDBC
connection string proxyUser=userName
Configuring JDBC Clients for Kerberos Authentication with HiveServer2
JDBC-based clients must include principal=<hive.server2.authentication.principal> in the JDBC
connection string. For example:
String url =
"jdbc:hive2://node1:10000/default;principal=hive/[email protected]"
Connection con = DriverManager.getConnection(url);

where hive is the principal configured in hive-site.xml and HiveServer2Host is the host where HiveServer2
is running.
For ODBC Clients, refer the Cloudera ODBC Driver for Apache Hive documentation.
Using Beeline to Connect to a Secure HiveServer2
Use the following command to start beeline and connect to a secure running HiveServer2 process. In this
example, the HiveServer2 process is running on localhost at port 10000:
$ /usr/lib/hive/bin/beeline
beeline> !connect
jdbc:hive2://localhost:10000/default;principal=hive/[email protected]
0: jdbc:hive2://localhost:10000/default>

For more information about the Beeline CLI, see Using the Beeline CLI.

Encrypted Communication with Client Drivers


With Kerberos or LDAP authentication enabled, traffic between the Hive JDBC or ODBC drivers and HiveServer2
can be encrypted which allows you to preserve data integrity (using checksums to validate message integrity)
and confidentiality (by encrypting messages). This can be enabled by setting the
hive.server2.thrift.sasl.qop property in hive-site.xml. For example,
<property>
<name>hive.server2.thrift.sasl.qop</name>
<value>auth</value>
<description>Sasl QOP value; one of 'auth', 'auth-int' and 'auth-conf'</description>
</property>

Valid settings for the value field are:


auth: Authentication only (default)
auth-int: Authentication with integrity protection
auth-conf: Authentication with confidentiality protection

Configuring Encrypted Client/Server Communication for non-Kerberos HiveServer2 Connections


For non-Kerberos connections, you can configure Secure Socket Layer (SSL) communication between HiveServer2
and clients.

130 | CDH 5 Security Guide

Hive Security Configuration


To enable server side support, add the following configuration parameters to hive-site.xml:
<property>
<name>hive.server2.use.SSL</name>
<value>true</value>
<description>enable/disable SSL </description>
</property>
<property>
<name>hive.server2.keystore.path</name>
<value>keystore-file-path</value>
<description>path to keystore file</description>
</property>
<property>
<name>hive.server2.keystore.password</name>
<value>keystore-file-password</value>
<description>keystore password</description>
</property>

The keystore must contain the server's certificate.


The JDBC client must add the following properties in the connection URL when connecting to a HiveServer2
using SSL:
;ssl=true[;sslTrustStore=<Trust-Store-Path>;trustStorePassword=<Trust-Store-password>]

Make sure one of the following is true:


Either: sslTrustStore points to the trust store file containing the server's certificate; for example:
jdbc:hive2://localhost:10000/default;ssl=true;\
sslTrustStore=/home/usr1/ssl/trust_store.jks;trustStorePassword=xyz

or: the Trust Store arguments are set using the Java system properties javax.net.ssl.trustStore
and javax.net.ssl.trustStorePassword; for example:
java -Djavax.net.ssl.trustStore=/home/usr1/ssl/trust_store.jks
-Djavax.net.ssl.trustStorePassword=xyz \
MyClass jdbc:hive2://localhost:10000/default;ssl=true

For more information on using self-signed certificates and the Trust Store, see the Oracle Java SE keytool page.

Using LDAP Username/Password Authentication with HiveServer2


As an alternative to Kerberos authentication, you can configure HiveServer2 to use user and password validation
backed by LDAP. In this case, the client sends a user name and password during the connection initiation.
HiveServer2 validates these credentials using an external LDAP service.
You can enable LDAP Authentication with HiveServer2 using Active Directory or OpenLDAP.
Important: When using LDAP username/password authentication with HiveServer2, make sure you
have enabled encrypted communication between HiveServer2 and its client drivers to avoid sending
cleartext passwords. For instructions, see Encrypted Communication with Client Drivers on page 130.
Also see Configuring LDAPS Authentication with HiveServer2 on page 132.

CDH 5 Security Guide | 131

Hive Security Configuration


Enabling LDAP Authentication with HiveServer2 using Active Directory
To enable the LDAP mode of authentication using Active Directory, include the following properties in the
hive-site.xml file:
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>LDAP_URL</value>
</property>
<property>
<name>hive.server2.authentication.ldap.Domain</name>
<value>DOMAIN</value>
</property>

where:
The LDAP_URL value is the access URL for your LDAP server. For example, ldap://[email protected].
Enabling LDAP Authentication with HiveServer2 using OpenLDAP
To enable the LDAP mode of authentication using OpenLDAP, include the following properties in the
hive-site.xml file:
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
<property>
<name>hive.server2.authentication.ldap.url</name>
<value>LDAP_URL</value>
</property>
<property>
<name>hive.server2.authentication.ldap.baseDN</name>
<value>LDAP_BaseDN</value>
</property>

where:
The LDAP_URL value is the access URL for your LDAP server.
The LDAP_BaseDN value is the base LDAP DN for your LDAP server. For example,
ou=People,dc=example,dc=com.
Configuring JDBC Clients for LDAP Authentication with HiveServer2
The JDBC client needs to use a connection URL as shown below. JDBC-based clients must include user=LDAP_Userid;password=LDAP_Password in the JDBC connection string.
For example:
String url = "jdbc:hive2://node1:10000/default;user=LDAP_Userid;password=LDAP_Password"
Connection con = DriverManager.getConnection(url);

where the LDAP_Userid value is the user id and LDAP_Password is the password of the client user.
For ODBC Clients, refer the Cloudera ODBC Driver for Apache Hive documentation.

Configuring LDAPS Authentication with HiveServer2


HiveServer2 supports LDAP username/password authentication for clients. Clients send LDAP credentials to
HiveServer2 which in turn verifies them with the configured LDAP provider such as OpenLDAP or Microsoft's
Active Directory. Most vendors now support LDAPS (LDAP over SSL), an authentication protocol that uses SSL
132 | CDH 5 Security Guide

Hive Security Configuration


to encrypt communication between the LDAP service and its client (in this case, HiveServer2) to avoid sending
LDAP credentials in cleartext.
Perform the following steps to configure the LDAPS service with HiveServer2:
Import either the LDAP server issuing Certificate Authority's SSL certificate into a local truststore, or import
the SSL server certificate for a specific trust. If you import the CA certificate, HiveServer2 will trust any server
with a certificate issued by the LDAP server's CA. If you only import the SSL certificate for a specific trust,
HiveServer2 will trust only that server. In both cases, the SSL certificate must be imported on to the same
host as HiveServer2. Please refer the keytool documentation for more details.
Make sure the truststore file is readable by the hive user.
Set the hive.server2.authentication.ldap.url configuration property in hive-site.xml to the LDAPS
URL. For example, ldaps://sample.myhost.com.
Note: The URL scheme should be ldaps and not ldap.
Set the environment variable HADOOP_OPTS as follows:
HADOOP_OPTS="-Djavax.net.ssl.trustStore=<trustStore-file-path>
-Djavax.net.ssl.trustStorePassword=<trustStore-password>"

For clusters managed by Cloudera Manager, go to the Hive service and select Configuration > View and Edit.
Under the HiveServer2 category, go to the Advanced section and set the HiveServer2 Environment Safety
Valve property.
Restart HiveServer2.

Pluggable Authentication
Pluggable authentication allows you to provide a custom authentication provider for HiveServer2.
To enable pluggable authentication:
1. Set the following properties in /etc/hive/conf/hive-site.xml:
<property>
<name>hive.server2.authentication</name>
<value>CUSTOM</value>
<description>Client authentication types.
NONE: no authentication check
LDAP: LDAP/AD based authentication
KERBEROS: Kerberos/GSSAPI authentication
CUSTOM: Custom authentication provider
(Use with property hive.server2.custom.authentication.class)
</description>
</property>
<property>
<name>hive.server2.custom.authentication.class</name>
<value>pluggable-auth-class-name</value>
<description>
Custom authentication class. Used when property
'hive.server2.authentication' is set to 'CUSTOM'. Provided class
must be a proper implementation of the interface
org.apache.hive.service.auth.PasswdAuthenticationProvider. HiveServer2
will call its Authenticate(user, passed) method to authenticate requests.
The implementation may optionally extend the Hadoop's
org.apache.hadoop.conf.Configured class to grab Hive's Configuration object.
</description>
</property>

2. Make the class available in the CLASSPATH of HiveServer2.

CDH 5 Security Guide | 133

Hive Security Configuration


Trusted Delegation with HiveServer2
HiveServer2 determines the identity of the connecting user from the underlying authentication subsystem
(Kerberos or LDAP). Any new session started for this connection runs on behalf of this connecting user. If the
server is configured to proxy the user at the Hadoop level, then all MapReduce jobs and HDFS accesses will be
performed with the identity of the connecting user. If Apache Sentry is configured, then this connecting userid
can also be used to verify access rights to underlying tables, views and so on.
In CDH 4.5, a connecting user (for example, hue) with Hadoop-level superuser privileges, can request an alternate
user for the given session. HiveServer2 will check if the connecting user has Hadoop-level privileges to proxy
the requested userid (for example, bob). If it does, then the new session will be run on behalf of the alternate
user, bob, requested by connecting user, hue.
To specify an alternate user for new connections, the JDBC client needs to add the
hive.server2.proxy.user=<alternate_user_id> property to the JDBC connection URL. Note that the
connecting user needs to have Hadoop-level proxy privileges over the alternate user. For example, if user hue
requests access to run a session as user bob, the JDBC connection string should be as follows:
# Login as super user Hue
kinit hue -k -t hue.keytab [email protected]
# Connect using following JDBC connection string
#
jdbc:hive2://myHost.myOrg.com:10000/default;principal=hive/[email protected];hive.server2.proxy.user=bob

HiveServer2 Impersonation
Note: This is not the recommended method to implement HiveServer2 impersonation. Cloudera
recommends you use Sentry to implement this instead.
Impersonation support in HiveServer2 allows users to execute queries and access HDFS files as the connected
user rather than the super user who started the HiveServer2 daemon. Impersonation allows admins to enforce
an access policy at the file level using HDFS file and directory permissions.
To enable impersonation in HiveServer2:
1. Add the following property to the /etc/hive/conf/hive-site.xml file and set the value to true. (The
default value is false.)
<property>
<name>hive.server2.enable.impersonation</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property>

2. In HDFS or MapReduce configurations, add the following property to the core-site.xml file:
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>

See also File System Permissions.

134 | CDH 5 Security Guide

Hive Security Configuration


Securing the Hive Metastore
Note: This is not the recommended method to protect the Hive Metastore. Cloudera recommends
you use Sentry to implement this instead.
To prevent users from accessing the Hive metastore and the Hive metastore database using any method other
than through HiveServer2, the following actions are recommended:
Add a firewall rule on the metastore service host to allow access to the metastore port only from the
HiveServer2 host. You can do this using iptables.
Grant access to the metastore database only from the metastore service host. This is specified for MySQL
as:
GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'metastorehost';

where MetaStoreHost is the host where the metastore service is running.


Make sure users who are not admins cannot log on to the host on which HiveServer2 runs.

Disabling the Hive Security Configuration


Hive's security related metadata is stored in the configuration file hive-site.xml. The following sections
describe how to disable security for the Hive service.
Disable Client/Server Authentication
To disable client/server authentication, set hive.server2.authentication to NONE. For example,
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description>
Client authentication types.
NONE: no authentication check
LDAP: LDAP/AD based authentication
KERBEROS: Kerberos/GSSAPI authentication
CUSTOM: Custom authentication provider
(Use with property hive.server2.custom.authentication.class)
</description>
</property>

Disable Hive Metastore security


To disable Hive Metastore security, perform the following steps:
Set the hive.metastore.sasl.enabled property to false in all configurations, the metastore service side
as well as for all clients of the metastore. For example, these might include HiveServer2, Impala, Pig and so
on.
Remove or comment the following parameters in hive-site.xml for the metastore service. Note that this
is a server-only change.
hive.metastore.kerberos.keytab.file
hive.metastore.kerberos.principal
Disable Underlying Hadoop Security
If you also want to disable the underlying Hadoop security, remove or comment out the following parameters
in hive-site.xml.
hive.server2.authentication.kerberos.keytab
hive.server2.authentication.kerberos.principal

CDH 5 Security Guide | 135

Hive Security Configuration

Hive Metastore Server Security Configuration


Important:
This section describes how to configure security for the Hive metastore server. If you are using
HiveServer2, see HiveServer2 Security Configuration.
Here is a summary of Hive metastore server security in CDH 5:
No additional configuration is required to run Hive on top of a security-enabled Hadoop cluster in standalone
mode using a local or embedded metastore.
HiveServer does not support Kerberos authentication for clients. While it is possible to run HiveServer with
a secured Hadoop cluster, doing so creates a security hole since HiveServer does not authenticate the Thrift
clients that connect to it. Instead, you can use HiveServer2 HiveServer2 Security Configuration.
The Hive metastore server supports Kerberos authentication for Thrift clients. For example, you can configure
a standalone Hive metastore server instance to force clients to authenticate with Kerberos by setting the
following properties in the hive-site.xml configuration file used by the metastore server:
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
<description>If true, the metastore thrift interface will be secured with SASL.
Clients must authenticate with Kerberos.</description>
</property>
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/etc/hive/conf/hive.keytab</value>
<description>The path to the Kerberos Keytab file containing the metastore thrift
server's service principal.</description>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/[email protected]</value>
<description>The service principal for the metastore thrift server. The special
string _HOST will be replaced automatically with the correct host
name.</description>
</property>

Note:
The values shown above for the hive.metastore.kerberos.keytab.file and
hive.metastore.kerberos.principal properties are examples which you will need to replace
with the appropriate values for your cluster. Also note that the Hive keytab file should have its
access permissions set to 600 and be owned by the same account that is used to run the Metastore
server, which is the hive user by default.
Requests to access the metadata are fulfilled by the Hive metastore impersonating the requesting user. This
includes read access to the list of databases, tables, properties of each table such as their HDFS location, file
type and so on. You can restrict access to the Hive metastore service by allowing it to impersonate only a
subset of Kerberos users. This can be done by setting the hadoop.proxyuser.hive.groups property in
core-site.xml on the Hive metastore host.
For example, if you want to give the hive user permission to impersonate members of groups hive and
user1:
<property>
<name>hadoop.proxyuser.hive.groups</name>

136 | CDH 5 Security Guide

Hive Security Configuration


<value>hive,user1</value>
</property>

In this example, the Hive metastore can impersonate users belonging to only the hive and user1 groups.
Connection requests from users not belonging to these groups will be rejected.

Using Hive to Run Queries on a Secure HBase Server


To use Hive to run queries on a secure HBase Server, you must set the following HIVE_OPTS environment variable:
env HIVE_OPTS="-hiveconf hbase.security.authentication=kerberos -hiveconf
hbase.rpc.engine=org.apache.hadoop.hbase.ipc.SecureRpcEngine -hiveconf
hbase.master.kerberos.principal=hbase/[email protected] -hiveconf
hbase.regionserver.kerberos.principal=hbase/[email protected] -hiveconf
hbase.zookeeper.quorum=zookeeper1,zookeeper2,zookeeper3" hive

where:
You replace YOUR-REALM with the name of your Kerberos realm
You replace zookeeper1,zookeeper2,zookeeper3 with the names of your ZooKeeper servers. The
hbase.zookeeper.quorum property is configured in the hbase-site.xml file.
The special string _HOST is replaced at run-time by the fully-qualified domain name of the host machine
where the HBase Master or Region Server is running. This requires that reverse DNS is properly working on
all the hosts configured this way.
In the following, _HOST is the name of the host where the HBase Master is running:
-hiveconf hbase.master.kerberos.principal=hbase/[email protected]

In the following, _HOST is the host name of the HBase Region Server that the application is connecting to:
-hiveconf hbase.regionserver.kerberos.principal=hbase/[email protected]

Tip:
You can also set the HIVE_OPTS environment variable in your shell profile.

CDH 5 Security Guide | 137

HCatalog Security Configuration

HCatalog Security Configuration


This section describes how to configure HCatalog in CDH 5 with Kerberos security in a Hadoop cluster:

Before You Start on page 139


Step 1: Create the HTTP keytab file on page 139
Step 2: Configure WebHCat to Use Security on page 139
Step 3: Create Proxy Users on page 140
Step 4: Verify the Configuration on page 140

For more information about HCatalog see Installing and Using HCatalog.

Before You Start


Secure Web HCatalog requires a running remote Hive metastore service configured in secure mode. See Hive
MetaStoreServer Security Configuration for instructions. Running secure WebHCat with an embedded repository
is not supported.

Step 1: Create the HTTP keytab file


You need to create a keytab file for WebHCat. Follow these steps:
1. Create the file:
kadmin: addprinc -randkey HTTP/[email protected]
kadmin: xst -k HTTP.keytab HTTP/fully.qualified.domain.name

2. Move the file into the WebHCat configuration directory and restrict its access exclusively to the hcatalog
user:
$ mv HTTP.keytab /etc/webhcat/conf/
$ chown hcatalog /etc/webhcat/conf/HTTP.keytab
$ chmod 400 /etc/webhcat/conf/HTTP.keytab

Step 2: Configure WebHCat to Use Security


Create or edit the WebHCat configuration file webhcat-site.xml in the configuration directory and set following
properties:
Property

Value

templeton.kerberos.secret

Any random value

templeton.kerberos.keytab

/etc/webhcat/conf/HTTP.keytab

templeton.kerberos.principal

HTTP/[email protected]

Example configuration:
<property>
<name>templeton.kerberos.secret</name>
<value>SuPerS3c3tV@lue!</value>

CDH 5 Security Guide | 139

HCatalog Security Configuration


</property>
<property>
<name>templeton.kerberos.keytab</name>
<value>/etc/webhcat/conf/HTTP.keytab</value>
</property>
<property>
<name>templeton.kerberos.principal</name>
<value>HTTP/[email protected]</value>
</property>

Step 3: Create Proxy Users


WebHCat needs access to your NameNode in order to work properly, and so you must configure Hadoop to allow
impersonation from the hcatalog user. To do this, edit your core-site.xml configuration file and set the
hadoop.proxyuser.HTTP.hosts and hadoop.proxyuser.HTTP.groups properties to specify the hosts from
which HCatalog can do the impersonation and what users can be impersonated. You can use the value * for
"any".
Example configuration:
<property>
<name>hadoop.proxyuser.HTTP.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.HTTP.groups</name>
<value>*</value>
</property>

Step 4: Verify the Configuration


After restarting WebHcat you can verify that it is working by using curl (you may need to run kinit first):
$ curl --negotiate -i -u :
'http://fully.qualified.domain.name:50111/templeton/v1/ddl/database'

140 | CDH 5 Security Guide

Llama Security Configuration

Llama Security Configuration


This section describes how to configure Llama in CDH 5 with Kerberos security in a Hadoop cluster.
Note: At this point Llama has been tested only in a Cloudera Manager deployment. For information
on using Cloudera Manager to configure Llama and Impala, see Installing Impala with Cloudera
Manager.

Configuring Llama to Support Kerberos Security


1. Create a Llama service user principal using the syntax:
llama/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate with the
Hadoop cluster, where fully.qualified.domain.name is the host where Llama is running and YOUR-REALM

is the name of your Kerberos realm:


$ kadmin
kadmin: addprinc -randkey
llama/fully.qualified.domain.name@M

2. Create a keytab file with the Llama principal:


$ kadmin
kadmin: xst -k llama.keytab llama/fully.qualified.domain.name

3. Test that the credentials in the keytab file work. For example:
$ klist -e -k -t llama.keytab

4. Copy the llama.keytab file to the Llama configuration directory. The owner of the llama.keytab file should
be the llama user and the file should have owner-only read permissions.
5. Edit the Llama llama-site.xml configuration file in the Llama configuration directory by setting the following
properties:
Property

Value

llama.am.server.thrift.security

true

llama.am.server.thrift.kerberos.keytab.file

llama/conf.keytab

llama.am.server.thrift.kerberos.server.principal.name llama/<fully.qualified.domain.name>
llama.am.server.thrift.kerberos.notification.principal.name impala

Important:
You must restart Llama to make the configuration changes take effect.

CDH 5 Security Guide | 141

ZooKeeper Security Configuration

ZooKeeper Security Configuration


This section describes how to configure ZooKeeper in CDH 5 to enable Kerberos security:
Configuring the ZooKeeper Server to Support Kerberos Security on page 143
Configuring the ZooKeeper Client Shell to Support Kerberos Security on page 144
Verifying the Configuration on page 144
Important:
Prior to enabling ZooKeeper to work with Kerberos security on your cluster, make sure you first review
the requirements in Configuring Hadoop Security in CDH 5.

Configuring the ZooKeeper Server to Support Kerberos Security


Note:
It is strongly recommended that you ensure a properly functioning ZooKeeper ensemble prior to
enabling security. See ZooKeeper Installation.
1. Create a service principal for the ZooKeeper server using the syntax:
zookeeper/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate the
ZooKeeper server with the Hadoop cluster. where: fully.qualified.domain.name is the host where the
ZooKeeper server is running YOUR-REALM is the name of your Kerberos realm.
kadmin: addprinc -randkey zookeeper/[email protected]

2. Create a keytab file for the ZooKeeper server.


$ kadmin
kadmin: xst -k zookeeper.keytab zookeeper/fully.qualified.domain.name

3. Copy the zookeeper.keytab file to the ZooKeeper configuration directory on the ZooKeeper server host.
For a package installation, the ZooKeeper configuration directory is /etc/zookeeper/conf/. For a tar ball
installation, the ZooKeeper configuration directory is <EXPANDED_DIR>/conf. The owner of the
zookeeper.keytab file should be the zookeeper user and the file should have owner-only read permissions.
4. Add the following lines to the ZooKeeper configuration file zoo.cfg:
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
jaasLoginRenew=3600000

5. Set up the Java Authentication and Authorization Service (JAAS) by creating a jaas.conf file in the ZooKeeper
configuration directory containing the following settings. Make sure that you substitute
fully.qualified.domain.name as appropriate.
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/zookeeper/conf/zookeeper.keytab"
storeKey=true
useTicketCache=false
principal="zookeeper/fully.qualified.domain.name@<YOUR-REALM>";
};

CDH 5 Security Guide | 143

ZooKeeper Security Configuration


6. Add the following setting to the java.env file located in the ZooKeeper configuration directory. (Create the
file if it does not already exist.)
export JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/jaas.conf"

7. If you have multiple ZooKeeper servers in the ensemble, repeat steps 1 through 6 above for each ZooKeeper
server. When you create each new Zookeeper Server keytab file in step 2, you can overwrite the previous
keytab file and use the same name (zookeeper.keytab) to maintain consistency across the ZooKeeper
servers in the ensemble. The difference in the keytab files will be the hostname where each server is running.
8. Restart the ZooKeeper server to have the configuration changes take effect. For instructions, see ZooKeeper
Installation.

Configuring the ZooKeeper Client Shell to Support Kerberos Security


1. If you want to use the ZooKeeper client shell zookeeper-client with Kerberos authentication, create a
principal using the syntax: zkcli@<YOUR-REALM>. This principal is used to authenticate the ZooKeeper client
shell to the ZooKeeper service. where: YOUR-REALM is the name of your Kerberos realm.
kadmin: addprinc -randkey [email protected]

2. Create a keytab file for the ZooKeeper client shell.


$ kadmin
kadmin: xst -norandkey -k zkcli.keytab [email protected]

Note:
Some versions of kadmin do not support the -norandkey option in the command above. If your
version does not, you can omit it from the command. Note that doing so will result in a new
password being generated every time you export a keytab, which will invalidate previously-exported
keytabs.
3. Set up JAAS in the configuration directory on the host where the ZooKeeper client shell is running. For a
package installation, the configuration directory is /etc/zookeeper/conf/. For a tar ball installation, the
configuration directory is <EXPANDED_DIR>/conf. Create a jaas.conf file containing the following settings:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/path/to/zkcli.keytab"
storeKey=true
useTicketCache=false
principal="zkcli@<YOUR-REALM>";
};

4. Add the following setting to the java.env file located in the configuration directory. (Create the file if it does
not already exist.)
export JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/jaas.conf"

Verifying the Configuration


1. Make sure that you have restarted the ZooKeeper cluster with Kerberos enabled, as described above.

144 | CDH 5 Security Guide

ZooKeeper Security Configuration


2. Start the client (where the hostname is the name of a ZooKeeper server):
zookeeper-client -server hostname:port

3. Create a protected znode from within the ZooKeeper CLI. Make sure that you substitute YOUR-REALM as
appropriate.
create /znode1 znode1data sasl:zkcli@{{YOUR-REALM}}:cdwra

4. Verify the znode is created and the ACL is set correctly:


getAcl /znode1

The results from getAcl should show that the proper scheme and permissions were applied to the znode.

CDH 5 Security Guide | 145

Search Security Configuration

Search Security Configuration


This section describes how to configure Search in CDH 5 to enable Kerberos security and Sentry.

Configuring Search to Use Kerberos


Cloudera Search supports Kerberos authentication. All necessary packages are installed when you install Search.
To enable Kerberos, create principals and keytabs and then modify default configurations.
The following instructions only apply to configuring Kerberos in an unmanaged environment. Kerberos
configuration is automatically handled by Cloudera Manager if you are using Search in a Cloudera managed
environment.
To create principals and keytabs
Repeat this process on all Solr server nodes.
1. Create a Solr service user principal using the syntax: solr/<fully.qualified.domain.name>@<YOUR-REALM>.
This principal is used to authenticate with the Hadoop cluster. where: fully.qualified.domain.name is
the host where the Solr server is running YOUR-REALM is the name of your Kerberos realm.
$ kadmin
kadmin: addprinc -randkey solr/[email protected]

2. Create a HTTP service user principal using the syntax:


HTTP/<fully.qualified.domain.name>@<YOUR-REALM>. This principal is used to authenticate user
requests coming to the Solr web-services. where: fully.qualified.domain.name is the host where the
Solr server is running YOUR-REALM is the name of your Kerberos realm.
kadmin: addprinc -randkey HTTP/[email protected]

Note:
The HTTP/ component of the HTTP service user principal must be upper case as shown in the
syntax and example above.
3. Create keytab files with both principals.
kadmin: xst -norandkey -k solr.keytab solr/fully.qualified.domain.name \
HTTP/fully.qualified.domain.name

4. Test that credentials in the merged keytab file work. For example:
$ klist -e -k -t solr.keytab

5. Copy the solr.keytab file to the Solr configuration directory. The owner of the solr.keytab file should be
the solr user and the file should have owner-only read permissions.
To modify default configurations
Repeat this process on all Solr server nodes.
1. Ensure that the following properties appear in /etc/default/solr and that they are uncommented. Modify
these properties to match your environment. The relevant properties to be uncommented and modified are:
SOLR_AUTHENTICATION_TYPE=kerberos
SOLR_AUTHENTICATION_SIMPLE_ALLOW_ANON=true
SOLR_AUTHENTICATION_KERBEROS_KEYTAB=/etc/solr/conf/solr.keytab
SOLR_AUTHENTICATION_KERBEROS_PRINCIPAL=HTTP/localhost@LOCALHOST
SOLR_AUTHENTICATION_KERBEROS_NAME_RULES=DEFAULT
SOLR_AUTHENTICATION_JAAS_CONF=/etc/solr/conf/jaas.conf

CDH 5 Security Guide | 147

Search Security Configuration


Note: Modify the values for these properties to match your environment. For example, the
SOLR_AUTHENTICATION_KERBEROS_PRINCIPAL=HTTP/localhost@LOCALHOST must include the
principal instance and Kerberos realm for your environment. That is often different from
localhost@LOCALHOST.
2. If using applications that use the solrj library, set up the Java Authentication and Authorization Service
(JAAS).
a. Create a jaas.conf file in the Solr configuration directory containing the following settings. This file and
its location must match the SOLR_AUTHENTICATION_JAAS_CONF value. Make sure that you substitute a
value for principal that matches your particular environment.
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="/etc/solr/conf/solr.keytab"
principal="solr/fully.qualified.domain.name@<YOUR-REALM>";
};

3. To use short principal names:


Appendix C - Configuring the Mapping from Kerberos Principals to Short Names in the CDH 5 Security
Guide.

Using Kerberos
The process of enabling Solr clients to authenticate with a secure Solr is specific to the client. This section
demonstrates:
Using Kerberos and curl
Using solrctl
Configuring SolrJ Library Usage
This enables technologies including:
Command-line solutions
Java applications
The MapReduceIndexerTool
Configuring Flume Morphline Solr Sink Usage
Secure Solr requires that the CDH components that it interacts with are also secure. Secure Solr interacts with
HDFS, ZooKeeper and optionally HBase, MapReduce, and Flume. See the CDH 5 Security Guide or the CDH 4
Security Guide for more information.
Using Kerberos and curl
You can use Kerberos authentication with clients such as curl. To use curl, begin by acquiring valid Kerberos
credentials and then execute the desired command. For example, you might use commands similar to the
following:
$ kinit -kt username.keytab username
$ curl --negotiate -u: foo:bar http://solrserver:8983/solr/

148 | CDH 5 Security Guide

Search Security Configuration


Note: Depending on the tool used to connect, additional arguments may be required. For example,
with curl, --negotiate and -u are required. The username and password specified with -u is not
actually checked because Kerberos is used. As a result, any value such as foo:bar or even just : is
acceptable. While any value can be provided for -u, note that the option is required. Omitting -u
results in a 401 Unauthorized error, even though the -u value is not actually used.
Using solrctl
If you are using solrctl to manage your deployment in an environment that requires Kerberos authentication,
you must have valid Kerberos credentials, which you can get using kinit. For more information on solrctl,
see Solrctl Reference
Configuring SolrJ Library Usage
If using applications that use the solrj library, begin by establishing a Java Authentication and Authorization
Service (JAAS) configuration file.
Create a JAAS file:
If you have already used kinit to get credentials, you can have the client use those credentials. In such a
case, modify your jaas-client.conf file to appear as follows:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true
principal="user/fully.qualified.domain.name@<YOUR-REALM>";
};

where user/fully.qualified.domain.name@<YOUR-REALM> is replaced with your credentials.


You want the client application to authenticate using a keytab you specify:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/path/to/keytab/user.keytab"
storeKey=true
useTicketCache=false
principal="user/fully.qualified.domain.name@<YOUR-REALM>";
};

where /path/to/keytab/user.keytab is the keytab file you wish to use and


user/fully.qualified.domain.name@<YOUR-REALM> is the principal in that keytab you wish to use.
Use the JAAS file to enable solutions:
Command line solutions
Set the property when invoking the program. For example, if you were using a a jar, you might use:
java -Djava.security.auth.login.config=/home/user/jaas-client.conf -jar app.jar

Java applications
Set the Java system property java.security.auth.login.config. For example, if the JAAS configuration
file is located on the filesystem as /home/user/jaas-client.conf. The Java system property
java.security.auth.login.config must be set to point to this file. Setting a Java system property can
be done programmatically, for example using a call such as:
System.setProperty("java.security.auth.login.config",
"/home/user/jaas-client.conf");

CDH 5 Security Guide | 149

Search Security Configuration


The MapReduceIndexerTool
The MapReduceIndexerTool uses SolrJ to pass the JAAS configuration file. Using the MapReduceIndexerTool
in a secure environment requires the use of the HADOOP_OPTS variable to specify the JAAS configuration file.
For example, you might issue a command such as the following:
HADOOP_OPTS="-Djava.security.auth.login.config=/home/user/jaas.conf" \
hadoop jar MapReduceIndexerTool

Configuring the hbase-indexer CLI


Certain hbase-indexer CLI commands such as replication-status attempt to read ZooKeeper nodes
owned by HBase. To successfully use these commands in Search for CDH 5 in a secure environment, specify
a JAAS configuration file with the HBase principal in the HBASE_INDEXER_OPTS environment variable. For
example, you might issue a command such as the following:
HBASE_INDEXER_OPTS="-Djava.security.auth.login.config=/home/user/hbase-jaas.conf"
\
hbase-indexer replication-status

Configuring Flume Morphline Solr Sink Usage


Repeat this process on all Flume nodes:
1. If you have not created a keytab file, do so now at /etc/flume-ng/conf/flume.keytab. This file should
contain the service principal flume/<fully.qualified.domain.name>@<YOUR-REALM>. See the CDH 5
Security Guide for more information.
2. Create a JAAS configuration file for flume at /etc/flume-ng/conf/jaas-client.conf. The file should
appear as follows:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
useTicketCache=false
keyTab="/etc/flume-ng/conf/flume.keytab"
principal="flume/<fully.qualified.domain.name>@<YOUR-REALM>";
};

3. Add the flume JAAS configuration to the JAVA_OPTS in /etc/flume-ng/conf/flume-env.sh. For example,
you might change:
JAVA_OPTS="-Xmx500m"

to:
JAVA_OPTS="-Xmx500m
-Djava.security.auth.login.config=/etc/flume-ng/conf/jaas-client.conf"

Configuring Sentry for Search


Sentry enables role-based, fine-grained authorization for Cloudera Search. Follow the instructions below to
configure Sentry under CDH 4.5 or later or CDH 5. Sentry is included in the Search installation.
Note: Sentry for Search depends on Kerberos authentication. For additional information on using
Kerberos with Search, see The Configuring Search to Use Kerberos and Using Kerberos sections earlier
in this page.

150 | CDH 5 Security Guide

Search Security Configuration


Note that this document is for configuring Sentry for Cloudera Search. For information about alternate ways to
configure Sentry or for information about installing Sentry for other services, see:
Setting Up Search Authorization with Sentry for instructions for using Cloudera Manager 4 to install and
configure Search Authorization with Sentry.
Impala Security for instructions on using Impala with Sentry.
Sentry Installation to install the version of Sentry that was provided with CDH 4.
Sentry Installation to install the version of Sentry that was provided with CDH 5.

Roles and Collection-Level Privileges


Sentry uses a role-based privilege model. A role is a set of rules for accessing a given Solr collection. Access to
each collection is governed by privileges: Query, Update, or All (*).
For example, a rule for the Query privilege on collection logs would be formulated as follows:
collection=logs->action=Query

A role can contain multiple such rules, separated by commas. For example the engineer_role might contain
the Query privilege for hive_logs and hbase_logs collections, and the Update privilege for the current_bugs
collection. You would specify this as follows:
engineer_role = collection=hive_logs->action=Query, collection=hbase_logs->action=Query,
collection=current_bugs->action=Update

Users and Groups


A user is an entity that is permitted by the Kerberos authentication system to access the Search service.
A group connects the authentication system with the authorization system. It is a set of one or more users
who have been granted one or more authorization roles. Sentry allows a set of roles to be configured for a
group.
A configured group provider determines a users affiliation with a group. The current release supports
HDFS-backed groups and locally configured groups. For example,
dev_ops = dev_role, ops_role

Here the group dev_ops is granted the roles dev_role and ops_role. The members of this group can complete
searches that are allowed by these roles.

User to Group Mapping


You can configure Sentry to use either Hadoop groups or groups defined in the policy file.
Important: You can use either Hadoop groups or local groups, but not both at the same time. Use
local groups if you want to do a quick proof-of-concept. For production, use Hadoop groups.
To configure Hadoop groups:
Set the sentry.provider property in sentry-site.xml to
org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider.

Note: Note that, by default, this uses local shell groups. See the Group Mapping section of the HDFS
Permissions Guide for more information.

OR
To configure local groups:

CDH 5 Security Guide | 151

Search Security Configuration


1. Define local groups in a [users] section of the Sentry Configuration File, sentry-site.xml. For example:
[users]
user1 = group1, group2, group3
user2 = group2, group3

2. In sentry-site.xml, set search.sentry.provider as follows:


<property>
<name>sentry.provider</name>
<value>org.apache.sentry.provider.file.LocalGroupResourceAuthorizationProvider</value>
</property>

Setup and Configuration


This release of Sentry stores the configuration as well as privilege policies in files. The sentry-site.xml file
contains configuration options such as privilege policy file location. The Policy File contains the privileges and
groups. It has a .ini file format and should be stored on HDFS.
Sentry is automatically installed when you install Cloudera Search for CDH or Cloudera Search 1.1.0 or later.

Policy File
The sections that follow contain notes on creating and maintaining the policy file.
Warning: An invalid configuration disables all authorization while logging an exception.

Storing the Policy File


Considerations for storing the policy file(s) include:
1. Replication count - Because the file is read for each query, you should increase this; 10 is a reasonable value.
2. Updating the file - Updates to the file are only reflected when the Solr process is restarted.

Defining Roles
Keep in mind that role definitions are not cumulative; the newer definition replaces the older one. For example,
the following results in role1 having privilege2, not privilege1 and privilege2.
role1 = privilege1
role1 = privilege2

Sample Configuration
This section provides a sample configuration.
Note: Sentry with CDH Search does not support multiple policy files. Other implementations of Sentry
such as Sentry for Hive do support different policy files for different databases, but Sentry for CDH
Search has no such support for multiple policies.

Policy File
The following is an example of a CDH Search policy file. The sentry-provider.ini would exist in an HDFS
location such as hdfs://ha-nn-uri/user/solr/sentry/sentry-provider.ini.

152 | CDH 5 Security Guide

Search Security Configuration


Note: Use separate policy files for each Sentry-enabled service. Using one file for multiple services
results in each service failing on the other services' entries. For example, with a combined Hive and
Search file, Search would fail on Hive entries and Hive would fail on Search entries.
sentry-provider.ini
[groups]
# Assigns each Hadoop group to its set of roles
engineer = engineer_role
ops = ops_role
dev_ops = engineer_role, ops_role
hbase_admin = hbase_admin_role
[roles]
# The following grants all access to source_code.
# "collection = source_code" can also be used as syntactic
# sugar for "collection = source_code->action=*"
engineer_role = collection = source_code->action=*
# The following imply more restricted access.
ops_role = collection = hive_logs->action=Query
dev_ops_role = collection = hbase_logs->action=Query
#give hbase_admin_role the ability to create/delete/modify the hbase_logs collection
hbase_admin_role = collection=admin->action=*, collection=hbase_logs->action=*

Sentry Configuration File


The following is an example of a sentry-site.xml file.
sentry-site.xml
<configuration>
<property>
<name>hive.sentry.provider</name>
<value>org.apache.sentry.provider.file.HadoopGroupResourceAuthorizationProvider</value>
</property>
<property>
<name>sentry.solr.provider.resource</name>
<value>/path/to/authz-provider.ini</value>
<!-If the HDFS configuration files (core-site.xml, hdfs-site.xml)
pointed to by SOLR_HDFS_CONFIG in /etc/default/solr
point to HDFS, the path will be in HDFS;
alternatively you could specify a full path,
e.g.:hdfs://namenode:port/path/to/authz-provider.ini
-->
</property>

Enabling Sentry in Cloudera Search for CDH 5


Enabling Sentry is achieved by adding two properties to /etc/default/solr. If your Search installation is
managed by Cloudera Manager, then these properties are added automatically. If your Search installation is not
managed by Cloudera Manager, you must make these changes yourself. The variable
SOLR_AUTHORIZATION_SENTRY_SITE specifies the path to sentry-site.xml. The variable
SOLR_AUTHORIZATION_SUPERUSER specifies the first part of SOLR_KERBEROS_PRINCIPAL. This is solr for the
majority of users, as solr is the default. Settings are of the form:
SOLR_AUTHORIZATION_SENTRY_SITE=/location/to/sentry-site.xml
SOLR_AUTHORIZATION_SUPERUSER=solr

CDH 5 Security Guide | 153

Search Security Configuration


To enable sentry collection-level authorization checking on a new collection, the instancedir for the collection
must use a modified version of solrconfig.xml with Sentry integration. The command solrctl instancedir
--generate generates two versions of solrconfig.xml: the standard solrconfig.xml without sentry
integration, and the sentry-integrated version called solrconfig.xml.secure. To use the sentry-integrated
version, replace solrconfig.xml with solrconfig.xml.secure before creating the instancedir.
If you have an existing collection using the standard solrconfig.xml called "foo" and an instancedir of the
same name, perform the following steps:
# generate a fresh instancedir
solrctl instancedir --generate foosecure
# download the existing instancedir from ZK into subdirectory "foo"
solrctl instancedir --get foo foo
# replace the existing solrconfig.xml with the sentry-enabled one
cp foosecure/conf/solrconfig.xml.secure foo/conf/solrconfig.xml
# update the instancedir in ZK
solrctl instancedir --update foo foo
# reload the collection
solrctl collection --reload foo

If you have an existing collection using a version of solrconfig.xml that you have modified, contact Support
for assistance.

Providing Document-Level Security Using Sentry


For role-based access control of a collection, an administrator modifies a Sentry role so it has query, update, or
administrative access, as described above.
Collection-level authorization is useful when the access control requirements for the documents in the collection
are the same, but users may want to restrict access to a subset of documents in a collection. This finer-grained
restriction could be achieved by defining separate collections for each subset, but this is difficult to manage,
requires duplicate documents for each collection, and requires that these documents be kept in sync.
Document-level access control solves this issue by associating authorization tokens with each document in the
collection. This enables granting Sentry roles access to sets of documents in a collection.
Document-Level Security Model
Document-level security depends on a chain of relationships between users, groups, roles, and documents.
Users are assigned to groups.
Groups are assigned to roles.
Roles are stored as "authorization tokens" in a specified field in the documents.
Document-level security supports restricting which documents can be viewed by which users. Access is provided
by adding roles as "authorization tokens" to a specified document field. Conversely, access is implicitly denied
by omitting roles from the specified field. In other words, in a document-level security enabled environment, a
user might submit a query that matches a document; if the user is not part of a group that has a role has been
granted access to the document, the result is not returned.
For example, Alice might belong to the administrators group. The administrators group may belong to the
doc-mgmt role. A document could be ingested and the doc-mgmt role could be added at ingest time. In such a
case, if Alice submitted a query that matched the document, Search would return the document, since Alice is
then allowed to see any document with the "doc-mgmt" authorization token.
Similarly, Bob might belong to the guests group. The guests group may belong to the public-browser role. If Bob
tried the same query as Alice, but the document did not have the public-browser role, Search would not return
the result because Bob does not belong to a group that is associated with a role that has access.
Note that collection-level authorization rules still apply, if enabled. Even if Alice is able to view a document given
document-level authorization rules, if she is not allowed to query the collection, the query will fail.
Roles are typically added to documents when those documents are ingested, either via the standard Solr APIs
or, if using morphlines, the setValues morphline command.
154 | CDH 5 Security Guide

Search Security Configuration


Enabling Document-Level Security
Cloudera Search supports document-level security in Search for CDH 5.1 and later. Document-level security is
disabled by default, so the first step in using document-level security is to enable the feature by modifying the
solrconfig.xml.secure file. Remember to replace the solrconfig.xml with this file, as described in Enabling
Sentry for Cloudera Search in CDH 5 above.
To enable document-level security, change solrconfig.xml.secure. The default file contents are as follows:
<searchComponent name="queryDocAuthorization">
<!-- Set to true to enabled document-level authorization -->
<bool name="enabled">false</bool>
<!-- Field where the auth tokens are stored in the document -->
<str name="sentryAuthField">sentry_auth</str>
<!-- Auth token defined to allow any role to access the document.
Uncomment to enable. -->
<!--<str name="allRolesToken">*</str>-->
</searchComponent>

The enabled Boolean determines whether document-level authorization is enabled. To enable document
level security, change this setting to true.
The sentryAuthField string specifies the name of the field that is used for storing authorization information.
You can use the default setting of sentry_auth or you can specify some other string that you will use for
assigning values on ingest.
Note: This field must exist as an explicit or dynamic field in the schema. sentry_auth exists in
the default schema.xml.
The allRolesToken string represents a special token defined to allow any role access to the document. By
default, this feature is disabled. To enable this feature, uncomment the specification and specify the token.
This token should be different from the name of any sentry role to avoid collision. By default it is "*". This
feature is useful when first configuring document level security or it can be useful in granting all roles access
to a document when the set of roles may change. See the following Best Practices section for additional
information.
Best Practices
Using the allGroupsToken
You may want to grant every user that belongs to a role access to certain documents. One way to accomplish
this is to specify all known roles in the document, but this requires updating or reindexing the document if you
add a new role. Alternatively, an "allUser" role, specified in the Sentry .ini file, could contain all valid groups, but
this role would need to be updated every time a new group was added to the system. Instead, specifying the
allGroupsToken allows any user that belongs to a valid role to access the document. This access requires no
updating as the system evolves.
In addition, the allGroupsToken may be useful for transitioning a deployment to use document-level security.
Instead of having to define all the roles upfront, all the documents can be specified with the allGroupsToken
and later modified as the roles are defined.
Consequences of Document-Level Authorization Only Affecting Queries
Document-level security does not prevent users from modifying documents or performing other update operations
on the collection. Update operations are only governed by collection-level authorization.

CDH 5 Security Guide | 155

Search Security Configuration


Document-level security can be used to prevent documents being returned in query results. If users are not
granted access to a document, those documents are not returned even if that user submits a query that matches
those documents. This does not have affect attempted updates.
Consequently, it is possible for a user to not have access to a set of documents based on document-level security,
but to still be able to modify the documents via their collection-level authorization update rights. This means
that a user can delete all documents in the collection. Similarly, a user might modify all documents, adding their
authorization token to each one. After such a modification, the user could access any document via querying.
Therefore, if you are restricting access using document-level security, consider granting collection-level update
rights only to those users you trust and assume they will be able to access every document in the collection.
Limitations on Query Size
By default queries support up to 1024 Boolean clauses. As a result, queries containing more that 1024 clauses
may cause errors. Because authorization information is added by Sentry as part of a query, using document-level
security can increase the number of clauses. In the case where users belong to many roles, even simple queries
can become quite large. If a query is too large, an error of the following form occurs:
org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

To change the supported number of clauses, edit the maxBooleanClauses setting in solrconfig.xml. For
example, to allow 2048 clauses, you would edit the setting so it appears as follows:
<maxBooleanClauses>2048</maxBooleanClauses>

For maxBooleanClauses to be applied as expected, make any change to this value to all collections and then
restart the service. You must make this change to all collections because this option modifies a global Lucene
property, affecting all SolrCores. If different solrconfig.xml files have different values for this property, the
effective value is determined per node, based on the first SolrCore to be initialized.

Enabling Secure Impersonation


Secure Impersonation is a feature that allows a user to make requests as another user in a secure way. For
example, to allow the following impersonations:
User "hue" can make requests as any user from any host.
User "foo" can make requests as any member of group "bar", from "host1" or "host2".
Configure the following properties in /etc/default/solr:
SOLR_SECURITY_ALLOWED_PROXYUSERS=hue,foo
SOLR_SECURITY_PROXYUSER_hue_HOSTS=*
SOLR_SECURITY_PROXYUSER_hue_GROUPS=*
SOLR_SECURITY_PROXYUSER_foo_HOSTS=host1,host2
SOLR_SECURITY_PROXYUSER_foo_GROUPS=bar
SOLR_SECURITY_ALLOWED_PROXYUSERS lists all of the users allowed to impersonate. For a user x in
SOLR_SECURITY_ALLOWED_PROXYUSERS, SOLR_SECURITY_PROXYUSER_x_HOSTS list the hosts x is allowed to
connect from in order to impersonate, and SOLR_SECURITY_PROXYUSERS_x_GROUPS lists the groups that the
users is allowed to impersonate members of. Both GROUPS and HOSTS support the wildcard * and both GROUPS
and HOSTS must be defined for a specific user.

Note: Cloudera Manager has its own management of secure impersonation for Hue. To add additional
users for Secure Impersonation, use the environment variable safety value for Solr to set the
environment variables as above. Be sure to include "hue" in SOLR_SECURITY_ALLOWED_PROXYUSERS
if you want to use secure impersonation for hue.

156 | CDH 5 Security Guide

Search Security Configuration


Debugging Failed Sentry Authorization Requests
Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why
Sentry is denying access, the best way to debug is to temporarily turn on debug logging:
In Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the logging settings for your service
through the corresponding Logging Safety Valve field for the Impala, Hive Server 2, or Solr Server services.
On systems not managed by Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the
log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as:
FilePermission server..., RequestPermission server...., result [true|false]

which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while
RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate
FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating
Access Denied.

Appendix: Authorization Privilege Model for Search


The tables below refer to the request handlers defined in the generated solrconfig.xml.secure. If you are
not using this configuration file, the below may not apply.
admin is a special collection in sentry used to represent administrative actions. A non-administrative request
may only require privileges on the collection on which the request is being performed. This is called collection1
in this appendix. An administrative request may require privileges on both the admin collection and collection1.
This is denoted as admin, collection1 in the tables below.

Table 11: Privilege table for non-administrative request handlers


Request Handler

Required Privilege

Collections that Require Privilege

select

QUERY

collection1

query

QUERY

collection1

get

QUERY

collection1

browse

QUERY

collection1

tvrh

QUERY

collection1

clustering

QUERY

collection1

terms

QUERY

collection1

elevate

QUERY

collection1

analysis/field

QUERY

collection1

analysis/document

QUERY

collection1

update

UPDATE

collection1

update/json

UPDATE

collection1

update/csv

UPDATE

collection1

Table 12: Privilege table for collections admin actions


Collection Action

Required Privilege

Collections that Require Privilege

create

UPDATE

admin, collection1

CDH 5 Security Guide | 157

Search Security Configuration


Collection Action

Required Privilege

Collections that Require Privilege

delete

UPDATE

admin, collection1

reload

UPDATE

admin, collection1

createAlias

UPDATE

admin, collection1
Note: "collection1" here
refers to the name of the
alias, not the underlying
collection(s). For example,
http://YOUR-HOST:8983/
solr/admin/collections?action=
CREATEALIAS&name=collection1
&collections=underlyingCollection

deleteAlias

UPDATE

admin, collection1
Note: "collection1" here
refers to the name of the
alias, not the underlying
collection(s). For example,
http://YOUR-HOST:8983/
solr/admin/collections?action=
DELETEALIAS&name=collection1

syncShard

UPDATE

admin, collection1

splitShard

UPDATE

admin, collection1

deleteShard

UPDATE

admin, collection1

Table 13: Privilege table for core admin actions


Collection Action

Required Privilege

Collections that Require Privilege

create

UPDATE

admin, collection1

rename

UPDATE

admin, collection1

load

UPDATE

admin, collection1

unload

UPDATE

admin, collection1

status

UPDATE

admin, collection1

persist

UPDATE

admin

reload

UPDATE

admin, collection1

swap

UPDATE

admin, collection1

mergeIndexes

UPDATE

admin, collection1

split

UPDATE

admin, collection1

prepRecover

UPDATE

admin, collection1

requestRecover

UPDATE

admin, collection1

requestSyncShard

UPDATE

admin, collection1

requestApplyUpdates

UPDATE

admin, collection1

158 | CDH 5 Security Guide

Search Security Configuration


Table 14: Privilege table for Info and AdminHandlers
Request Handler

Required Privilege

Collections that Require Privilege

LukeRequestHandler

QUERY

admin

SystemInfoHandler

QUERY

admin

SolrInfoMBeanHandler

QUERY

admin

PluginInfoHandler

QUERY

admin

ThreadDumpHandler

QUERY

admin

PropertiesRequestHandler

QUERY

admin

LogginHandler

QUERY, UPDATE (or *)

admin

ShowFileRequestHandler

QUERY

admin

CDH 5 Security Guide | 159

FUSE - Mountable HDFS Security Configuration

FUSE - Mountable HDFS Security Configuration


This section describes how to use FUSE (Filesystem in Userspace) and CDH with Kerberos security on your
Hadoop cluster. FUSE enables you to mount HDFS, which makes HDFS files accessible just as if they were UNIX
files.
To use FUSE and CDH with Kerberos security, follow these guidelines:
For each HDFS user, make sure that there is a UNIX user with the same name. If there isn't, some files in the
FUSE mount point will appear to be owned by a non-existent user. Although this is harmless, it can cause
confusion.
When using Kerberos authentication, users must run kinit before accessing the FUSE mount point. Failure
to do this will result in I/O errors when the user attempts to access the mount point. For security reasons,
it is not possible to list the files in the mount point without first running kinit.
When a user runs kinit, all processes that run as that user can use the Kerberos credentials. It is not
necessary to run kinit in the same shell as the process accessing the FUSE mount point.

CDH 5 Security Guide | 161

Sqoop, Pig, and Whirr Security Support Status

Sqoop, Pig, and Whirr Security Support Status


Here is a summary of the status of security in the other CDH 5 components:
Sqoop 1 and Pig support security with no configuration required.
Sqoop 2 and Whirr do not support security in CDH 5.

CDH 5 Security Guide | 163

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted


HDFS Transport
This section describes how to configure encrypted shuffle, encrypted Web UIs, and encrypted HDFS transports:
Encrypted Shuffle and Encrypted Web UIs on page 165
HDFS Encrypted Transport on page 171

Encrypted Shuffle and Encrypted Web UIs


Now that you've enabled Kerberos, which provides for strong authentication, you can optionally enable network
encryption if you so desire. CDH 5 supports the Encrypted Shuffle and Encrypted Web UIs feature that allows
encryption of the MapReduce shuffle and web server ports using HTTPS with optional client authentication (also
known as bi-directional HTTPS, or HTTPS with client certificates). It includes:
Hadoop configuration setting for toggling the shuffle between HTTP and HTTPS.
Hadoop configuration setting for toggling the Web UIs to use either HTTP or HTTPS.
Hadoop configuration settings for specifying the keystore and truststore properties (location, type, passwords)
that are used by the shuffle service, web server UIs and the reducers tasks that fetch shuffle data.
A way to re-read truststores across the cluster (when a node is added or removed).
CDH 5 supports Encrypted Shuffle for both MRv1 and MRv2 (YARN), with common configuration properties used
for both versions. The only configuration difference is in the parameters used to enable the features:
For MRv1, setting the hadoop.ssl.enabled parameter in the core-site.xml file enables both the Encrypted
Shuffle and the Encrypted Web UIs. In other words, the encryption toggling is coupled for the two features.
For MRv2, setting the hadoop.ssl.enabled parameter enables the Encrypted Web UI feature; setting the
mapreduce.shuffle.ssl.enabled parameter in the mapred-site.xml file enables the Encrypted Shuffle
feature. =
All other configuration properties apply to both the Encrypted Shuffle and Encrypted Web UI functionality.
When the Encrypted Web UI feature is enabled, all Web UIs for Hadoop components are served over HTTPS. If
you configure the systems to require client certificates, browsers must be configured with the appropriate client
certificates in order to access the Web UIs.
Important:
When the Web UIs are served over HTTPS, you must specify https:// as the protocol; there is no
redirection from http://. If you attempt to access an HTTPS resource over HTTP, your browser will
probably show an empty screen with no warning.
Most components that run on top of MapReduce automatically use Encrypted Shuffle when it is configured.

Configuring Encrypted Shuffle and Encrypted Web UIs


To configure Encrypted Shuffle and Encrypted Web UIs, set the appropriate property/value pairs in the following:

core-site.xml enables these features and defines the implementation


mapred-site.xml enables Encrypted Shuffle for MRv2
ssl-server.xml stores keystone and truststore settings for the server
ssl-client.xml stores keystone and truststore settings for the client

CDH 5 Security Guide | 165

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
core-site.xml Properties
To configure encrypted shuffle, set the following properties in the core-site.xml files of all nodes in the cluster:
Property

Default Value

Explanation

hadoop.ssl.enabled

false

For MRv1, setting this value to true


enables both the Encrypted Shuffle
and the Encrypted Web UI features.
For MRv2, this property only enables
the Encrypted WebUI; Encrypted
Shuffle is enabled with a property in
the mapred-site.xml file as
described below.

hadoop.ssl.require.client.cert false

When this property is set to true,


client certificates are required for all
shuffle operations and all browsers
used to access Web UIs.
Cloudera recommends that this be
set to false. See Client Certificates
on page 170.

hadoop.ssl.hostname.verifier DEFAULT

The hostname verifier to provide for


HttpsURLConnections. Valid values
are: DEFAULT, STRICT, STRICT_I6,
DEFAULT_AND_LOCALHOST and
ALLOW_ALL.

hadoop.ssl.keystores.factory.class org.apache.hadoop
.security.ssl.
FileBasedKeyStoresFactory

The KeyStoresFactory
implementation to use.

hadoop.ssl.server.conf

ssl-server.xml

Resource file from which ssl server


keystore information is extracted.
This file is looked up in the
classpath; typically it should be in
the /etc/hadoop/conf/ directory.

hadoop.ssl.client.conf

ssl-client.xml

Resource file from which ssl server


keystore information is extracted.
This file is looked up in the
classpath; typically it should be in
the /etc/hadoop/conf/ directory.

Note:
All these properties should be marked as final in the cluster configuration files.
Example
<configuration>
...
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>

166 | CDH 5 Security Guide

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
<final>true</final>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.enabled</name>
<value>true</value>
</property>
...
</configuration>

The cluster should be configured to use the Linux Task Controller in MRv1 and Linux container executor in MRv2
to run job tasks so that they are prevented from reading the server keystore information and gaining access to
the shuffle server certificates. Refer to Appendix B - Information about Other Hadoop Security Programs for
more information.
mapred-site.xml Property (MRv2 only)
To enable Encrypted Shuffle for MRv2, set the following property in the mapred-site.xml file on every node in
the cluster:
Property

Default Value

mapreduce.shuffle.ssl.enabled false

Explanation
If this property is set to true,
encrypted shuffle is enabled. If this
property is not specified, it defaults
to the value of
hadoop.ssl.enabled. This value
can be false when
hadoop.ssl.enabled is true but
can not be true when
hadoop.ssl.enabled is false

This property should be marked as final in the cluster configuration files.


Example:
<configuration>
...
<property>
<name>mapreduce.shuffle.ssl.enabled</name>
<value>true</value>

CDH 5 Security Guide | 167

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
<final>true</final>
</property>
...
</configuration>

Keystore and Truststore Settings


FileBasedKeyStoresFactory is the only KeyStoresFactory that is currently implemented. It uses properties
in the ssl-server.xml and ssl-client.xml files to configure the keystores and truststores.

ssl-server.xml (Shuffle server and Web UI) Configuration


Use the following settings to configure the keystores and truststores in the ssl-server.xml file.
Note:
The ssl-server.xml should be owned by the hdfs or mapred Hadoop system user, belong to the
hadoop group, and it should have 440 permissions. Regular users should not belong to the hadoop
group.
Property

Default Value

Description

ssl.server.keystore.type

jks

Keystore file type

ssl.server.keystore.location NONE

Keystore file location. The mapred user should own


this file and have exclusive read access to it.

ssl.server.keystore.password NONE

Keystore file password

ssl.server.keystore.keypassword NONE

Key password

ssl.server.truststore.type

Truststore file type

jks

ssl.server.truststore.location NONE

Truststore file location. The mapred user should


own this file and have exclusive read access to it.

ssl.server.truststore.password NONE

Truststore file password

ssl.server.truststore.reload.interval 10000

Truststore reload interval, in milliseconds

Example
<configuration>
<!-- Server Certificate Store -->
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value>${user.home}/keystores/server-keystore.jks</value>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>serverfoo</value>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value>serverfoo</value>
</property>
<!-- Server Trust Store -->

168 | CDH 5 Security Guide

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.server.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>

ssl-client.xml (Reducer/Fetcher) Configuration


Use the following settings to configure the keystores and truststores in the ssl-client.xml file. This file should
be owned by the mapred user for MRv1 and by the yarn user for MRv2; the file permissions should be 444 (read
access for all users).
Property

Default Value

Description

ssl.client.keystore.type

jks

Keystore file type

ssl.client.keystore.location NONE

Keystore file location. The mapred user


should own this file and it should have
default permissions.

ssl.client.keystore.password NONE

Keystore file password

ssl.client.keystore.keypassword NONE

Key password

ssl.client.truststore.type

Truststore file type

jks

ssl.client.truststore.location NONE

Truststore file location. The mapred user


should own this file and it should have
default permissions.

ssl.client.truststore.password NONE

Truststore file password

ssl.client.truststore.reload.interval 10000

Truststore reload interval, in milliseconds

Example
<configuration>
<!-- Client certificate Store -->
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.keystore.location</name>
<value>${user.home}/keystores/client-keystore.jks</value>
</property>
<property>
<name>ssl.client.keystore.password</name>
<value>clientfoo</value>
</property>
<property>
<name>ssl.client.keystore.keypassword</name>

CDH 5 Security Guide | 169

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
<value>clientfoo</value>
</property>
<!-- Client Trust Store -->
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.location</name>
<value>${user.home}/keystores/truststore.jks</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>clientserverbar</value>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
</property>
</configuration>

Activating Encrypted Shuffle


When you have made the above configuration changes, activate Encrypted Shuffle by re-starting all TaskTrackers
in MRv1 and all NodeManagers in YARN.
Important:
Encrypted shuffle has a significant performance impact. You should benchmark this before
implementing it in production. In many cases, one or more additional are needed to maintain
performance.

Client Certificates
Client Certificates are supported but they do not guarantee that the client is a reducer task for the job. The Client
Certificate keystore file that contains the private key must be readable by all users who submit jobs to the
cluster, which means that a rogue job could read those keystore files and use the client certificates in them to
establish a secure connection with a Shuffle server. The JobToken mechanism that the Hadoop environment
provides is a better protector of the data; each job uses its own JobToken to retrieve only the shuffle data that
belongs to it. Unless the rogue job has a proper JobToken, it cannot retrieve Shuffle data from the Shuffle server.
Important:
If your certificates are signed by a certificate authority (CA), you must include the complete chain of
CA certificates in the keystore that has the server's key.

Reloading Truststores
By default, each truststore reloads its configuration every 10 seconds. If a new truststore file is copied over the
old one, it is re-read, and its certificates replace the old ones. This mechanism is useful for adding or removing
nodes from the cluster, or for adding or removing trusted clients. In these cases, the client, TaskTracker or
NodeManager certificate is added to (or removed from) all the truststore files in the system, and the new
configuration is picked up without requiring that the TaskTracker in MRv1 and NodeManager in YARN daemons
are restarted.

170 | CDH 5 Security Guide

Configuring Encrypted Shuffle, Encrypted Web UIs, and Encrypted HDFS Transport
Note:
The keystores are not automatically reloaded. To change a keystore for a TaskTracker in MRv1 or a
NodeManager in YARN, you must restart the TaskTracker or NodeManager daemon.
The reload interval is controlled by the ssl.client.truststore.reload.interval and
ssl.server.truststore.reload.interval configuration properties in the ssl-client.xml and
ssl-server.xml files described above.

Debugging
Important:
Enable debugging only for troubleshooting, and then only for jobs running on small amounts of data.
Debugging is very verbose and slows jobs down significantly.
To enable SSL debugging in the reducers, set -Djavax.net.debug=all in the mapred.reduce.child.java.opts
property; for example:
<configuration>
...
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx200m -Djavax.net.debug=all</value>
</property>
...
</configuration>

You can do this on a per-job basis, or by means of a cluster-wide setting in mapred-site.xml.


To set this property in TaskTrackersfor MRv1, set it in hadoop-env.sh:
HADOOP_TASKTRACKER_OPTS="-Djavax.net.debug=all $HADOOP_TASKTRACKER_OPTS"

To set this property in NodeManagers for YARN, set it in hadoop-env.sh:


YARN_OPTS="-Djavax.net.debug=all $YARN_OPTS"

HDFS Encrypted Transport


HDFS Encrypted Transport allows encryption of all HDFS data sent over the network.
To enable encryption, proceed as follows:
1. Enable the Hadoop Security using Kerberos, following these instructions.
2. Set the optional RPC encryption by setting hadoop.rpc.protection to "privacy" in the core-site.xml
file in both client and server configurations.
Note:
If RPC encryption is not enabled, transmission of other HDFS data is also insecure.
3. Set dfs.encrypt.data.transfer to true in the hdfs-site.xml file on all server systems.
4. Restart all daemons.

CDH 5 Security Guide | 171

Integrating Hadoop Security with Active Directory

Integrating Hadoop Security with Active Directory


One of the ramifications of enabling security on a Hadoop cluster is that every user who interacts with the
cluster must have a Kerberos principal configured. For organizations that use Active Directory to manage user
accounts, it can be onerous to create corresponding user accounts for each user in an MIT Kerberos realm.
Fortunately, it is possible to integrate Active Directory with Hadoop's security features.
Important: With CDH 5.1, clusters managed by Cloudera Manager 5.1 do not require a local MIT KDC
and are able to integrate directly with an Active Directory KDC. Cloudera recommends you use a
direct-to-AD setup. For instructions, refer Configuring Hadoop Security with Cloudera Manager.
If direct integration with AD is not currently possible, use the following instructions to configure a local MIT KDC
to trust your AD server:
1. Run an MIT Kerberos KDC and realm local to the cluster and create all service principals in this realm.
2. Set up one-way cross-realm trust from this realm to the Active Directory realm. Using this method, there is
no need to create service principals in Active Directory, but Active Directory principals (users) can be
authenticated to Hadoop. See Configuring a Local MIT Kerberos Realm to Trust Active Directory on page 173.

Configuring a Local MIT Kerberos Realm to Trust Active Directory


On the Active Directory Server
Note: Run these commands on every domain controller that is advertising itself for the Active Directory
domain (and every domain controller that is promoted subsequently).
1. Type the following command to specify the local MIT KDC host name (for example,
kdc-server-hostname.cluster.corp.company.com) and local realm (for example,
YOUR-LOCAL-REALM.COMPANY.COM):
ksetup /addkdc YOUR-LOCAL-REALM.COMPANY.COM
kdc-server-hostname.cluster.corp.company.com

2. Type the following command to add the local realm trust to Active Directory:
netdom trust YOUR-LOCAL-REALM.COMPANY.COM /Domain:AD-REALM.COMPANY.COM /add /realm
/passwordt:<TrustPassword>

3. Type the following command to set the proper encryption type:


On Windows 2003 RC2:
ktpass /MITRealmName YOUR-LOCAL-REALM.COMPANY.COM /TrustEncryp <enc_type>

On Windows 2008:
ksetup /SetEncTypeAttr YOUR-LOCAL-REALM.COMPANY.COM <enc_type>

where the <enc_type> parameter specifies AES, DES, or RC4 encryption. Refer to the documentation for your
version of Windows Active Directory to find the <enc_type> parameter string to use.
Important: Make sure the encryption type you specify is supported on both your version of Windows
Active Directory and your version of MIT Kerberos.

CDH 5 Security Guide | 173

Integrating Hadoop Security with Active Directory


On the MIT KDC server
Type the following command in the kadmin.local or kadmin shell to add the cross-realm krbtgt principal. Use
the same password you used in the netdom command on the Active Directory Server.
kadmin: addprinc -e "<enc_type_list>"
krbtgt/[email protected]

where the <enc_type_list> parameter specifies the types of encryption this cross-realm krbtgt principal will
support: either AES, DES, or RC4 encryption. You can specify multiple encryption types using the parameter in
the command above, what's important is that at least one of the encryption types corresponds to the encryption
type found in the tickets granted by the KDC in the remote realm. For example:
kadmin: addprinc -e "rc4-hmac:normal des3-hmac-sha1:normal"
krbtgt/[email protected]

The cross-realm krbtgt principal that you add in this step must have at least one entry that uses the same
encryption type as the tickets that are issued by the remote KDC. If no entries have the same encryption
type, then the problem you will see is that authenticating as a principal in the local realm will allow you to
successfully run Hadoop commands, but authenticating as a principal in the remote realm will not allow you
to run Hadoop commands.

On all of the cluster machines


1. Verify that both Kerberos realms are configured on all of the cluster boxes. Note that the default realm and
the domain realm should remain set as the MIT Kerberos realm which is local to the cluster.
[realms]
AD-REALM.CORP.FOO.COM = {
kdc = ad.corp.foo.com:88
admin_server = ad.corp.foo.com:749
default_domain = foo.com
}
CLUSTER-REALM.CORP.FOO.COM = {
kdc = cluster01.corp.foo.com:88
admin_server = cluster01.corp.foo.com:749
default_domain = foo.com
}

2. To properly translate principal names from the Active Directory realm into local names within Hadoop, you
must configure the hadoop.security.auth_to_local setting in the core-site.xml file on all of the
cluster machines. The following example translates all principal names with the realm
AD-REALM.CORP.FOO.COM into the first component of the principal name only. It also preserves the standard
translation for the default realm (the cluster realm).
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1@$0](^.*@AD-REALM\.CORP\.FOO\.COM$)s/^(.*)@AD-REALM\.CORP\.FOO\.COM$/$1/g
RULE:[2:$1@$0](^.*@AD-REALM\.CORP\.FOO\.COM$)s/^(.*)@AD-REALM\.CORP\.FOO\.COM$/$1/g
DEFAULT
</value>
</property>

For more information about name mapping rules, see: Configuring the Mapping from Kerberos Principals to
Short Names

174 | CDH 5 Security Guide

Integrating Hadoop Security with Alternate Authentication

Integrating Hadoop Security with Alternate Authentication


One of the ramifications of enabling security on a Hadoop cluster is that every user who interacts with the
cluster must have a Kerberos principal configured. For some of the services, specifically Oozie and Hadoop (for
example, JobTracker and TaskTracker), it can be convenient to run a mixed form of authentication where Kerberos
authentication is used for API or command line access while some other form of authentication (for example,
SSO and LDAP) is used for accessing Web UIs. Using an alternate authentication deployment is considered an
advanced topic because only a partial implementation is provided in this release: you will have to implement
some of the code yourself.
Note:
The following instructions assume you have already performed the installation and configuration
steps in Configuring Hadoop Security in CDH 5.
Proceed as follows:
Step 1: Configure the AuthenticationFilter to use Kerberos on page 175
Step 2: Creating an AltKerberosAuthenticationHandler Subclass on page 175
Step 3: Enabling Your AltKerberosAuthenticationHandler Subclass on page 176
See also the Example Implementation for Oozie on page 177.

Step 1: Configure the AuthenticationFilter to use Kerberos


First, you must do all of the steps in the Server Side Configuration section of the Hadoop Auth, Java HTTP SPNEGO
Documentation to configure AuthenticationFilter to use Kerberos. You must configure
AuthenticationFilter to use Kerberos before doing the steps below.

Step 2: Creating an AltKerberosAuthenticationHandler Subclass


An AuthenticationHandler is installed on the server-side to handle authenticating clients and creating an
AuthenticationToken.
1. Subclass the
org.apache.hadoop.security.authentication.server.AltKerberosAuthenticationHandler class
(in the hadoop-auth package).

2. When a client sends a request, the authenticate method will be called. For browsers,
AltKerberosAuthenticationHandler will call the alternateAuthenticate method, which is what you
need to implement to interact with the desired authentication mechanism. For non-browsers,
AltKerberosAuthenticationHandler will follow the Kerberos SPNEGO sequence (this is provided for you).
3. The alternateAuthenticate(HttpServletRequest request, HttpServletResponse response)
method in your subclass should following these rules:
4. Return null if the authentication is still in progress; the response object can be used to interact with the
client.
5. Throw an AuthenticationException if the authentication failed.
6. Return an AuthenticationToken if the authentication completed successfully.

CDH 5 Security Guide | 175

Integrating Hadoop Security with Alternate Authentication

Step 3: Enabling Your AltKerberosAuthenticationHandler Subclass


You can enable the alternate authentication on Hadoop Web UIs, Oozie Web UIs, or both. You will need to include
a JAR containing your subclass on the classpath of Hadoop and/or Oozie. All Kerberos-related configuration
properties will still apply.

Step 3a: Enabling Your AltKerberosAuthenticationHandler Subclass on Hadoop Web UIs


1. Stop Hadoop by running the following command on every node in your cluster (as root):
$ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x stop ; done

2. Set the following property in core-site.xml, where


org.my.subclass.of.AltKerberosAuthenticationHandler is the classname of your subclass:
<property>
<name>hadoop.http.authentication.type</name>
<value>org.my.subclass.of.AltKerberosAuthenticationHandler</value>
</property>

3. (Optional) You can also specify which user-agents you do not want to be considered as browsers by setting
the following property as required (default value is shown). Note that all Java-based programs (such as
Hadoop client) will use java as their user-agent.
<property>
<name>hadoop.http.authentication.alt-kerberos.non-browser.user-agents</name>
<value>java,curl,wget,perl</value>
</property>

4. Copy the JAR containing your subclass into /usr/lib/hadoop/lib/.


5. Start Hadoop by running the following command:
$ for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x start ; done

Step 3b: Enabling Your AltKerberosAuthenticationHandler Subclass on Oozie Web UI


Note:
These instructions assume you have already performed the installation and configuration steps in
Oozie Security Configuration.
1. Stop the Oozie Server:
sudo /sbin/service oozie stop

2. Set the following property in oozie-site.xml, where


org.my.subclass.of.AltKerberosAuthenticationHandler is the classname of your subclass:
<property>
<name>oozie.authentication.type</name>
<value>org.my.subclass.of.AltKerberosAuthenticationHandler</value>
</property>

176 | CDH 5 Security Guide

Integrating Hadoop Security with Alternate Authentication


3. (Optional) You can also specify which user-agents you do not want to be considered as browsers by setting
the following property as required (default value is shown). Note that all Java-based programs (such as
Hadoop client) will use java as their user-agent.
<property>
<name>oozie.authentication.alt-kerberos.non-browser.user-agents</name>
<value>java,curl,wget,perl</value>
</property>

4. Copy the JAR containing your subclass into /var/lib/oozie.


5. Start the Oozie Server:
sudo /sbin/service oozie start

Example Implementation for Oozie


Warning:
The example implementation is NOT SECURE. Its purpose is to be as simple as possible, as an example
of how to write your own AltKerberosAuthenticationHandler subclass.
It should NOT be used in a production environment
An example implementation of AltKerberosAuthenticationHandler is included (though not built by default)
with Oozie. Also included is a simple Login Server with two implementations. The first one will authenticate any
user who is using a username and password that are identical, such as foo:foo. The second one can be configured
against an LDAP server to use LDAP for authentication.
You can read comprehensive documentation on the example at Creating Custom Authentication.
Important:
If you installed Oozie from the CDH packages and are deploying oozie-login.war alongside
oozie.war, you will also need to run the following commands after you copy the oozie-login.war
file to /usr/lib/oozie/oozie-server (if using YARN or /usr/lib/oozie/oozie-server-0.20
if using MRv1) because it won't automatically be expanded:
jar xvf oozie-login.war
mkdir oozie-login
mv META-INF oozie-login/
mv WEB-INF oozie-login/

CDH 5 Security Guide | 177

Appendix A Troubleshooting

Appendix A Troubleshooting
This Troubleshooting appendix contains sample Kerberos configuration files, krb5.conf and kdc.conf for your
reference. It also has solutions to potential problems you might face when configuring a secure cluster:
Sample Kerberos Configuration files: krb5.conf, kdc.conf, kadm5.acl
Problem 1: Running any Hadoop command fails after enabling security.
Problem 2: Java is unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1
or higher.
Problem 3: java.io.IOException: Incorrect permission
Problem 4: A cluster fails to run jobs after security is enabled.
Problem 5: The NameNode does not start and KrbException Messages (906) and (31) are displayed.
Problem 6: The NameNode starts but clients cannot connect to it and error message contains enctype code
18.
(MRv1 Only) Problem 7: Jobs won't run and TaskTracker is unable to create a local mapred directory.
(MRv1 Only) Problem 8: Jobs won't run and TaskTracker is unable to create a Hadoop logs directory.
Problem 9: After you enable cross-realm trust, you can run Hadoop commands in the local realm but not in
the remote realm.
mapred.local.dir> (MRv1 Only) Problem 10: Jobs won't run and can't access files in mapred.local.dir.
Problem 11: Users are unable to obtain credentials when running Hadoop jobs or commands.
Problem 12: Request is a replay exceptions in the logs. on page 188

Sample Kerberos Configuration files: krb5.conf, kdc.conf, kadm5.acl


Sample kdc.conf:
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
EXAMPLE.COM = {
#master_key_type = aes256-cts
max_renewable_life = 7d 0h 0m 0s
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
# note that aes256 is ONLY supported in Active Directory in a domain / forrest operating
at a 2008 or greater functional level.
# aes256 requires that you download and deploy the JCE Policy files for your JDK release
level to provide
# strong java encryption extension levels like AES256. Make sure to match based on the
encryption configured within AD for
# cross realm auth, note that RC4 = arcfour when comparing windows and linux enctypes
supported_enctypes = aes256-cts:normal aes128-cts:normal arcfour-hmac:normal
default_principal_flags = +renewable, +forwardable
}

Sample krb5.conf:
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = false

CDH 5 Security Guide | 179

Appendix A Troubleshooting
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
# udp_preference_limit = 1
#
#
#
#

set udp_preference_limit = 1 when TCP only should be


used. Consider using in complex network environments when
troubleshooting or when dealing with inconsistent
client behavior or GSS (63) messages.

# uncomment the following if AD cross realm auth is ONLY providing DES encrypted tickets
# allow-weak-crypto = true
[realms]
AD-REALM.EXAMPLE.COM = {
kdc = AD1.ad-realm.example.com:88
kdc = AD2.ad-realm.example.com:88
admin_server = AD1.ad-realm.example.com:749
admin_server = AD2.ad-realm.example.com:749
default_domain = ad-realm.example.com
}
EXAMPLE.COM = {
kdc = kdc1.example.com:88
admin_server = kdc1.example.com:749
default_domain = example.com
}
# The domain_realm is critical for mapping your host domain names to the kerberos realms
# that are servicing them. Make sure the lowercase left hand portion indicates any
domains or subdomains
# that will be related to the kerberos REALM on the right hand side of the expression.
REALMs will
# always be UPPERCASE. For example, if your actual DNS domain was test.com but your
kerberos REALM is
# EXAMPLE.COM then you would have,
[domain_realm]
test.com = EXAMPLE.COM
#AD domains and realms are usually the same
ad-domain.example.com = AD-REALM.EXAMPLE.COM
ad-realm.example.com = AD-REALM.EXAMPLE.COM

Sample kadm5.acl:
*/[email protected] *
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

180 | CDH 5 Security Guide

*
*
*
*
*
*
*
*
*
*
*
*
*
*

flume/*@HADOOP.COM
hbase/*@HADOOP.COM
hdfs/*@HADOOP.COM
hive/*@HADOOP.COM
httpfs/*@HADOOP.COM
HTTP/*@HADOOP.COM
hue/*@HADOOP.COM
impala/*@HADOOP.COM
mapred/*@HADOOP.COM
oozie/*@HADOOP.COM
solr/*@HADOOP.COM
sqoop/*@HADOOP.COM
yarn/*@HADOOP.COM
zookeeper/*@HADOOP.COM

Appendix A Troubleshooting

Problem 1: Running any Hadoop command fails after enabling security.


Description:
A user must have a valid Kerberos ticket in order to interact with a secure Hadoop cluster. Running any Hadoop
command (such as hadoop fs -ls) will fail if you do not have a valid Kerberos ticket in your credentials cache.
If you do not have a valid ticket, you will receive an error such as:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed
on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No
valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Solution:
You can examine the Kerberos tickets currently in your credentials cache by running the klist command. You
can obtain a ticket by running the kinit command and either specifying a keytab file containing credentials, or
entering the password for your principal.

Problem 2: Java is unable to read the Kerberos credentials cache created


by versions of MIT Kerberos 1.8.1 or higher.
Description:
If you are running MIT Kerberos 1.8.1 or higher, the following error will occur when you attempt to interact with
the Hadoop cluster, even after successfully obtaining a Kerberos ticket using kinit:
11/01/04 12:08:12 WARN ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed
on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No
valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Because of a change [1] in the format in which MIT Kerberos writes its credentials cache, there is a bug [2] in
the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache
created by versions of MIT Kerberos 1.8.1 or higher. Kerberos 1.8.1 is the default in Ubuntu Lucid and later
releases and Debian Squeeze and later releases. (On RHEL and CentOS, an older version of MIT Kerberos which
does not have this issue, is the default.)
Footnotes:
[1] MIT Kerberos change: http://krbdev.mit.edu/rt/Ticket/Display.html?id=6206
[2] Report of bug in Oracle JDK 6 Update 26 and earlier:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6979329

CDH 5 Security Guide | 181

Appendix A Troubleshooting
Solution:
If you encounter this problem, you can work around it by running kinit -R after running kinit initially to obtain
credentials. Doing so will cause the ticket to be renewed, and the credentials cache rewritten in a format which
Java can read. To illustrate this:
$ klist
klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1000)
$ hadoop fs -ls
11/01/04 13:15:51 WARN ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed
on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No
valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit
Password for [email protected]:
$ klist
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: [email protected]
Valid starting
Expires
Service principal
01/04/11 13:19:31 01/04/11 23:19:31 krbtgt/[email protected]
renew until 01/05/11 13:19:30
$ hadoop fs -ls
11/01/04 13:15:59 WARN ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Failed to find any Kerberos tgt)]
Bad connection to FS. command aborted. exception: Call to nn-host/10.0.0.2:8020 failed
on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No
valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
$ kinit -R
$ hadoop fs -ls
Found 6 items
drwx------ atm atm
0 2011-01-02 16:16 /user/atm/.staging

Note:
This workaround for Problem 2 requires the initial ticket to be renewable. Note that whether or not
you can obtain renewable tickets is dependent upon a KDC-wide setting, as well as a per-principal
setting for both the principal in question and the Ticket Granting Ticket (TGT) service principal for the
realm. A non-renewable ticket will have the same values for its "valid starting" and "renew until"
times. If the initial ticket is not renewable, the following error message is displayed when attempting
to renew the ticket:
kinit: Ticket expired while renewing credentials

Problem 3: java.io.IOException: Incorrect permission


Description:
An error such as the following example is displayed if the user running one of the Hadoop daemons has a umask
of 0002, instead of 0022:
java.io.IOException: Incorrect permission for
/var/folders/B3/B3d2vCm4F+mmWzVPB89W6E+++TI/-Tmp-/tmpYTil84/dfs/data/data1,
expected: rwxr-xr-x, while actual: rwxrwxr-x
at org.apache.hadoop.util.DiskChecker.checkPermission(DiskChecker.java:107)
at

182 | CDH 5 Security Guide

Appendix A Troubleshooting
org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:144)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:160)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1484)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1432)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1408)
at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:418)
at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:279)
at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:203)
at
org.apache.hadoop.test.MiniHadoopClusterManager.start(MiniHadoopClusterManager.java:152)
at
org.apache.hadoop.test.MiniHadoopClusterManager.run(MiniHadoopClusterManager.java:129)
at
org.apache.hadoop.test.MiniHadoopClusterManager.main(MiniHadoopClusterManager.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.test.AllTestDriver.main(AllTestDriver.java:83)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Solution:
Make sure that the umask for hdfs and mapred is 0022.

Problem 4: A cluster fails to run jobs after security is enabled.


Description:
A cluster that was previously configured to not use security may fail to run jobs for certain users on certain
TaskTrackers (MRv1) or NodeManagers (YARN) after security is enabled due to the following sequence of events:
1. A cluster is at some point in time configured without security enabled.
2. A user X runs some jobs on the cluster, which creates a local user directory on each TaskTracker or
NodeManager.
3. Security is enabled on the cluster.
4. User X tries to run jobs on the cluster, and the local user directory on (potentially a subset of) the TaskTrackers
or NodeManagers is owned by the wrong user or has overly-permissive permissions.
The bug is that after step 2, the local user directory on the TaskTracker or NodeManager should be cleaned up,
but isn't.

CDH 5 Security Guide | 183

Appendix A Troubleshooting
If you're encountering this problem, you may see errors in the TaskTracker or NodeManager logs. The following
example is for a TaskTracker on MRv1:
10/11/03 01:29:55 INFO mapred.JobClient: Task Id : attempt_201011021321_0004_m_000011_0,
Status : FAILED
Error initializing attempt_201011021321_0004_m_000011_0:
java.io.IOException: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:212)
at
org.apache.hadoop.mapred.LinuxTaskController.initializeUser(LinuxTaskController.java:442)
at
org.apache.hadoop.mapreduce.server.tasktracker.Localizer.initializeUserDirs(Localizer.java:272)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:963)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2209)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2174)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:250)
at org.apache.hadoop.util.Shell.run(Shell.java:177)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:370)
at org.apache.hadoop.mapred.LinuxTaskController.runCommand(LinuxTaskController.java:203)
... 5 more

Solution:
Delete the mapred.local.dir or yarn.nodemanager.local-dirs directories for that user across the cluster.

Problem 5: The NameNode does not start and KrbException Messages


(906) and (31) are displayed.
Description:
When you attempt to start the NameNode, a login failure occurs. This failure prevents the NameNode from
starting and the following KrbException messages are displayed:
Caused by: KrbException: Integrity check on decrypted field failed (31) PREAUTH_FAILED}}

and
Caused by: KrbException: Identifier doesn't match expected value (906)

Note:
These KrbException error messages are displayed only if you enable debugging output. See Appendix
D - Enabling Debugging Output for the Sun Kerberos Classes.

Solution:
Although there are several possible problems that can cause these two KrbException error messages to display,
here are some actions you can take to solve the most likely problems:
If you are using CentOS/Red Hat Enterprise Linux 5.6 or later, or Ubuntu, which use AES-256 encryption by
default for tickets, you must install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction
Policy File on all cluster and Hadoop user machines. For information about how to verify the type of encryption
used in your cluster, see Step 3: If you are Using AES-256 Encryption, install the JCE Policy File on page 24.
Alternatively, you can change your kdc.conf or krb5.conf to not use AES-256 by removing
aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. Note that
after changing the kdc.conf file, you'll need to restart both the KDC and the kadmin server for those changes

184 | CDH 5 Security Guide

Appendix A Troubleshooting
to take affect. You may also need to recreate or change the password of the relevant principals, including
potentially the TGT principal (krbtgt/REALM@REALM).
In the [realms] section of your kdc.conf file, in the realm corresponding to HADOOP.LOCALDOMAIN, add (or
replace if it's already present) the following variable:
supported_enctypes = des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal
des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3

Recreate the hdfs keytab file and mapred keytab file using the -norandkey option in the xst command (for
details, see Step 4: Create and Deploy the Kerberos Principals and Keytab Files on page 25).
kadmin.local: xst -norandkey -k hdfs.keytab hdfs/fully.qualified.domain.name
HTTP/fully.qualified.domain.name
kadmin.local: xst -norandkey -k mapred.keytab mapred/fully.qualified.domain.name
HTTP/fully.qualified.domain.name

Problem 6: The NameNode starts but clients cannot connect to it and error
message contains enctype code 18.
Description:
The NameNode keytab file does not have an AES256 entry, but client tickets do contain an AES256 entry. The
NameNode starts but clients cannot connect to it. The error message doesn't refer to "AES256", but does contain
an enctype code "18".
Solution:
Make sure the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy File" is installed or
remove aes256-cts:normal from the supported_enctypes field of the kdc.conf or krb5.conf file. For more
information, see the first suggested solution above for Problem 5.
For more information about the Kerberos encryption types, see
http://www.iana.org/assignments/kerberos-parameters/kerberos-parameters.xml.

(MRv1 Only) Problem 7: Jobs won't run and TaskTracker is unable to create
a local mapred directory.
Description:
The TaskTracker log contains the following error message:
11/08/17 14:44:06 INFO mapred.TaskController: main : user is atm
11/08/17 14:44:06 INFO mapred.TaskController: Failed to create directory
/var/log/hadoop/cache/mapred/mapred/local1/taskTracker/atm - No such file or directory
11/08/17 14:44:06 WARN mapred.TaskTracker: Exception while localization
java.io.IOException: Job initialization failed (20)
at
org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)

CDH 5 Security Guide | 185

Appendix A Troubleshooting
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at
org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
... 8 more

Solution:
Make sure the value specified for mapred.local.dir is identical in mapred-site.xml and taskcontroller.cfg.
If the values are different, the error message above is returned.

(MRv1 Only) Problem 8: Jobs won't run and TaskTracker is unable to create
a Hadoop logs directory.
Description:
The TaskTracker log contains an error message similar to the following :
11/08/17 14:48:23 INFO mapred.TaskController: Failed to create directory
/home/atm/src/cloudera/hadoop/build/hadoop-0.23.2-cdh3u1-SNAPSHOT/logs1/userlogs/job_201108171441_0004
- No such file or directory
11/08/17 14:48:23 WARN mapred.TaskTracker: Exception while localization
java.io.IOException: Job initialization failed (255)
at
org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:191)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
Caused by: org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at
org.apache.hadoop.mapred.LinuxTaskController.initializeJob(LinuxTaskController.java:184)
... 8 more

Solution:
In MRv1, the default value specified for hadoop.log.dir in mapred-site.xml is
/var/log/hadoop-0.20-mapreduce. The path must be owned and be writable by the mapred user. If you
change the default value specified for hadoop.log.dir, make sure the value is identical in mapred-site.xml
and taskcontroller.cfg. If the values are different, the error message above is returned.

186 | CDH 5 Security Guide

Appendix A Troubleshooting

Problem 9: After you enable cross-realm trust, you can run Hadoop
commands in the local realm but not in the remote realm.
Description:
After you enable cross-realm trust, authenticating as a principal in the local realm will allow you to successfully
run Hadoop commands, but authenticating as a principal in the remote realm will not allow you to run Hadoop
commands. The most common cause of this problem is that the principals in the two realms either don't have
the same encryption type, or the cross-realm principals in the two realms don't have the same password. This
issue manifests itself because you are able to get Ticket Granting Tickets (TGTs) from both the local and remote
realms, but you are unable to get a service ticket to allow the principals in the local and remote realms to
communicate with each other.
Solution:
On the local MIT KDC server host, type the following command in the kadmin.local or kadmin shell to add the
cross-realm krbtgt principal:
kadmin: addprinc -e "<enc_type_list>"
krbtgt/[email protected]

where the <enc_type_list> parameter specifies the types of encryption this cross-realm krbtgt principal will
support: AES, DES, or RC4 encryption. You can specify multiple encryption types using the parameter in the
command above, what's important is that at least one of the encryption types parameters corresponds to the
encryption type found in the tickets granted by the KDC in the remote realm. For example:
kadmin: addprinc -e "aes256-cts:normal rc4-hmac:normal des3-hmac-sha1:normal"
krbtgt/[email protected]

(MRv1 Only) Problem 10: Jobs won't run and can't access files in
mapred.local.dir .
Description:
The TaskTracker log contains the following error message:
WARN org.apache.hadoop.mapred.TaskTracker: Exception while localization
java.io.IOException: Job initialization failed (1)

Solution:
1. Add the mapred user to the mapred and hadoop groups on all hosts.
2. Restart all TaskTrackers.

CDH 5 Security Guide | 187

Appendix A Troubleshooting

Problem 11: Users are unable to obtain credentials when running Hadoop
jobs or commands.
Description:
This error occurs because the ticket message is too large for the default UDP protocol. An error message similar
to the following may be displayed:
13/01/15 17:44:48 DEBUG ipc.Client: Exception encountered while connecting to the server
: javax.security.sasl.SaslException:
GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism
level: Fail to create credential.
(63) - No service creds)]

Solution:
Force Kerberos to use TCP instead of UDP by adding the following parameter to libdefaults in the krb5.conf
file on the client(s) where the problem is occurring.
[libdefaults]
udp_preference_limit = 1

Note:
More Info About the udp_preference_limit Property
When sending a message to the KDC, the library will try using TCP before UDP if the size of the ticket
message is larger than the setting specified for the udp_preference_limit property. If the ticket
message is smaller than udp_preference_limit setting, then UDP will be tried before TCP. Regardless
of the size, both protocols will be tried if the first attempt fails.

Problem 12: Request

is a replay

exceptions in the logs.

Description:
Symptom: The following exception shows up in the logs for one or more of the Hadoop daemons:
2013-02-28 22:49:03,152 INFO ipc.Server (Server.java:doRead(571)) - IPC Server listener
on 8020: readAndProcess threw exception javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism l
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure
unspecified at GSS-API level (Mechanism level: Request is a replay (34))]
at
com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:159)
at org.apache.hadoop.ipc.Server$Connection.saslReadAndProcess(Server.java:1040)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1213)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:566)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:363)
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Request
is a replay (34))
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:741)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:323)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:267)
at
com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:137)
... 4 more
Caused by: KrbException: Request is a replay (34)

188 | CDH 5 Security Guide

Appendix A Troubleshooting
at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:300)
at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:134)
at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:79)
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:724)
... 7 more

In addition, this problem can manifest itself as performance issues for all clients in the cluster, including dropped
connections, timeouts attempting to make RPC calls, and so on.
Likely causes:
Multiple services in the cluster are using the same kerberos principal. All secure clients that run on multiple
machines should use unique kerberos principals for each machine. For example, rather than connecting as
a service principal [email protected], services should have per-host principals such as
myservice/[email protected].
Clocks not in synch: All hosts should run NTP so that clocks are kept in synch between clients and servers.

CDH 5 Security Guide | 189

Appendix B - Information about Other Hadoop Security Programs

Appendix B - Information about Other Hadoop Security Programs


This section contains information about the following programs:
MRv1 and YARN: A binary called jsvc that is in the bigtop-jsvc package and installed in either
/usr/lib/bigtop-utils/jsvc or /usr/libexec/bigtop-utils/jsvc depending on the particular Linux
flavor. See MRv1 and YARN: The jsvc Program on page 191.
MRv1 Only: A setuid binary called task-controller that is in the hadoop-0.20-mapreduce package and
installed in either /usr/lib/hadoop-0.20-mapreduce/sbin/Linux-amd64-64/task-controller or
/usr/lib/hadoop-0.20-mapreduce/sbin/Linux-i386-32/task-controller. See MRv1 Only: The Linux
TaskController Program on page 191.
YARN only: A setuid binary called container-executor that is in the hadoop-yarn package and installed
in /usr/lib/hadoop-yarn/bin/container-executor. See YARN Only: The Linux Container Executor
Program on page 191.

MRv1 and YARN: The jsvc Program


The jsvc (more info) program is used to start the DataNode listening on low port numbers. Its entry point is
the SecureDataNodeStarter class, which implements the Daemon interface that jsvc expects. jsvc is run
as root, and calls the SecureDataNodeStarter.init(...) method while running as root. Once the
SecureDataNodeStarter class has finished initializing, jsvc sets the effective UID to be the hdfs user, and
then calls SecureDataNodeStarter.start(...). SecureDataNodeStarter then calls the regular DataNode
entry point, passing in a reference to the privileged resources it previously obtained.

MRv1 Only: The Linux TaskController Program


The task-controller program, which is used on MRv1 only, allows the TaskTracker to run tasks under the
Unix account of the user who submitted the job in the first place. It is a setuid binary that must have a very
specific set of permissions and ownership in order to function correctly. In particular, it must:
1.
2.
3.
4.

Be owned by root
Be owned by a group that contains only the user running the MapReduce daemons
Be setuid
Be group readable and executable

This corresponds to the ownership root:mapred and the permissions 4754.


Here is the output of ls on a correctly-configured Task-controller:
-rwsr-xr-- 1 root mapred 30888 Mar 18 13:03 task-controller

The TaskTracker will check for this configuration on start up, and fail to start if the Task-controller is not configured
correctly.

YARN Only: The Linux Container Executor Program


The container-executor program, which is used on YARN only and supported on GNU/Linux only, runs the
containers as the user who submitted the application. It requires all user accounts to be created on the cluster
nodes where the containers are launched. It uses a setuid executable that is included in the Hadoop distribution.
The NodeManager uses this executable to launch and kill containers. The setuid executable switches to the user
who has submitted the application and launches or kills the containers. For maximum security, this executor
sets up restricted permissions and user/group ownership of local files and directories used by the containers
CDH 5 Security Guide | 191

Appendix B - Information about Other Hadoop Security Programs


such as the shared objects, jars, intermediate files, log files, and so on. As a result, only the application owner
and NodeManager can access any of the local files/directories including those localized as part of the distributed
cache.
The container-executor program must have a very specific set of permissions and ownership in order to
function correctly. In particular, it must:
1.
2.
3.
4.

Be owned by root
Be owned by a group that contains only the user running the YARN daemons
Be setuid
Be group readable and executable

This corresponds to the ownership root:yarn and the permissions 6050.


---Sr-s--- 1 root yarn 91886 2012-04-01 19:54 container-executor

192 | CDH 5 Security Guide

Appendix C - Configuring the Mapping from Kerberos Principals to Short Names

Appendix C - Configuring the Mapping from Kerberos Principals


to Short Names
You configure the mapping from Kerberos principals to short names in the hadoop.security.auth_to_local
property setting in the core-site.xml file. Kerberos has this support natively, and Hadoop's implementation
reuses Kerberos's configuration language to specify the mapping.
A mapping consists of a set of rules that are evaluated in the order listed in the
hadoop.security.auth_to_local property. The first rule that matches a principal name is used to map that
principal name to a short name. Any later rules in the list that match the same principal name are ignored.
You specify the mapping rules on separate lines in the hadoop.security.auth_to_local property as follows:
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[<principal translation>](<acceptance filter>)<short name substitution>
RULE:[<principal translation>](<acceptance filter>)<short name substitution>
DEFAULT
</value>
</property>

Mapping Rule Syntax


To specify a mapping rule, use the prefix string RULE: followed by three sectionsprincipal translation, acceptance
filter, and short name substitutiondescribed in more detail below. The syntax of a mapping rule is:
RULE:[<principal translation>](<acceptance filter>)<short name substitution>

Principal Translation
The first section of a rule, <principal translation>, performs the matching of the principal name to the rule.
If there is a match, the principal translation also does the initial translation of the principal name to a short
name. In the <principal translation> section, you specify the number of components in the principal name
and the pattern you want to use to translate those principal component(s) and realm into a short name. In
Kerberos terminology, a principal name is a set of components separated by slash ("/") characters.
The principal translation is composed of two parts that are both specified within "[ ]" using the following syntax:
[<number of components in principal name>:<initial specification of short name>]

where:
<number of components in principal name> This first part specifies the number of components in the principal
name (not including the realm) and must be 1 or 2. A value of 1 specifies principal names that have a single
component (for example, hdfs), and 2 specifies principal names that have two components (for example,
hdfs/fully.qualified.domain.name). A principal name that has only one component will only match
single-component rules, and a principal name that has two components will only match two-component rules.
<initial specification of short name> This second part specifies a pattern for translating the principal
component(s) and the realm into a short name. The variable $0 translates the realm, $1 translates the first
component, and $2 translates the second component.
Here are some examples of principal translation sections. These examples use [email protected] and
atm/[email protected] as principal name inputs:

CDH 5 Security Guide | 193

Appendix C - Configuring the Mapping from Kerberos Principals to Short Names


This Principal Translates
Translates
Translation
[email protected] into atm/[email protected] into
this short name
this short name
[1:$1@$0]

[email protected]

Rule does not match1

[1:$1]

atm

Rule does not match1

[1:$1.foo]

atm.foo

Rule does not match1

[2:$1/$2@$0] Rule does not match


[2:$1/$2]

Rule does not match2


2

[2:$1@$0]

Rule does not match

[2:$1]

Rule does not match2

atm/[email protected]
atm/fully.qualified.domain.name
[email protected]
atm

Footnotes:
1

Rule does not match because there are two components in principal name

atm/[email protected]
2

Rule does not match because there is one component in principal name [email protected]

Acceptance Filter
The second section of a rule, (<acceptance filter>), matches the translated short name from the principal
translation (that is, the output from the first section). The acceptance filter is specified in "( )" characters and is
a standard regular expression. A rule matches only if the specified regular expression matches the entire
translated short name from the principal translation. That is, there's an implied ^ at the beginning of the pattern
and an implied $ at the end.

Short Name Substitution


The third and final section of a rule is the (<short name substitution>). If there is a match in the second
section, the acceptance filter, the (<short name substitution>) section does a final translation of the short
name from the first section. This translation is a sed replacement expression (s/.../.../g) that translates
the short name from the first section into the final short name string. The short name substitution section is
optional. In many cases, it is sufficient to use the first two sections only.

Converting Principal Names to Lowercase


In some organizations, naming conventions result in mixed-case usernames (for example, John.Doe) or even
uppercase usernames (for example, JDOE) in Active Directory or LDAP. This can cause a conflict when the Linux
username and HDFS home directory are lowercase.
To convert principal names to lowercase, append /L to the rule.

Example Rules
Suppose all of your service principals are either of the form
App.service-name/[email protected] or
[email protected], and you want to map these to the short name string service-name.

To do this, your rule set would be:


<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1](App\..*)s/App\.(.*)/$1/g

194 | CDH 5 Security Guide

Appendix C - Configuring the Mapping from Kerberos Principals to Short Names


RULE:[2:$1](App\..*)s/App\.(.*)/$1/g
DEFAULT
</value>
</property>

The first $1 in each rule is a reference to the first component of the full principal name, and the second $1 is a
regular expression back-reference to text that is matched by (.*).
In the following example, suppose your company's naming scheme for user accounts in Active Directory is
FirstnameLastname (for example, JohnDoe), but user home directories in HDFS are /user/firstnamelastname.
The following rule set converts user accounts in the CORP.EXAMPLE.COM domain to lowercase.
<property>
<name>hadoop.security.auth_to_local</name>
<value>RULE:[1:$1@$0](.*@\QCORP.EXAMPLE.COM\E$)s/@\QCORP.EXAMPLE.COM\E$///L
RULE:[2:$1@$0](.*@\QCORP.EXAMPLE.COM\E$)s/@\QCORP.EXAMPLE.COM\E$///L
DEFAULT</value>
</property>

In this example, the [email protected] principal becomes the johndoe HDFS user.

Default Rule
You can specify an optional default rule called DEFAULT (see example above). The default rule reduces a principal
name down to its first component only. For example, the default rule reduces the principal names
[email protected] or atm/[email protected] down to atm, assuming that
the default domain is YOUR-REALM.COM.
The default rule applies only if the principal is in the default realm.
If a principal name does not match any of the specified rules, the mapping for that principal name will fail.

Testing Mapping Rules


You can test mapping rules for a long principal name by running:
$ hadoop org.apache.hadoop.security.HadoopKerberosName name1 name2 name3

CDH 5 Security Guide | 195

Appendix D - Enabling Debugging Output for the Sun Kerberos Classes

Appendix D - Enabling Debugging Output for the Sun Kerberos


Classes
Initially getting a secure Hadoop cluster configured properly can be tricky, especially for those who are not yet
familiar with Kerberos. To help with this, it can be useful to enable debugging output for the Sun Kerberos
classes. To do so, set the HADOOP_OPTS environment variable to the following:
HADOOP_OPTS="-Dsun.security.krb5.debug=true"

CDH 5 Security Guide | 197

Appendix E - Task-controller and Container-executor Error Codes

Appendix E - Task-controller and Container-executor Error Codes


The task-controller and container-executor programs are setuid binaries that Hadoop uses internally to
run tasks under the Unix account of the user who submitted the job. For more information about these programs,
see Appendix B - Information about Other Hadoop Security Programs.
When you set up a secure cluster for the first time and debug problems with it, the task-controller or
container-executor may encounter errors. These programs communicate these errors to the TaskTracker or
NodeManager daemon via numeric error codes which will appear in the TaskTracker or NodeManager logs
respectively (/var/log/hadoop-mapreduce or /var/log/hadoop-yarn). The following sections list the
possible numeric error codes with descriptions of what they mean:
MRv1 ONLY: Task-controller Error Codes on page 199
YARN ONLY: Container-executor Error Codes on page 201

MRv1 ONLY: Task-controller Error Codes


The following table applies to the task-controller in MRv1.
Numeric
Code

Name

Description

INVALID_ARGUMENT_NUMBER

Incorrect number of arguments provided for


the given task-controller command
Failure to initialize the job localizer

INVALID_USER_NAME

The user passed to the task-controller does not


exist.

INVALID_COMMAND_PROVIDED

The task-controller does not recognize the


command it was asked to execute.

SUPER_USER_NOT_ALLOWED_TO_RUN_TASKS The user passed to the task-controller was the


super user.

INVALID_TT_ROOT

The passed TaskTracker root does not match the


configured TaskTracker root
(mapred.local.dir), or does not exist.

SETUID_OPER_FAILED

Either could not read the local groups database,


or could not set UID or GID

UNABLE_TO_EXECUTE_TASK_SCRIPT

The task-controller could not execute the task


launcher script.

UNABLE_TO_KILL_TASK

The task-controller could not kill the task it was


passed.

INVALID_TASK_PID

The PID passed to the task-controller was


negative or 0.

10

ERROR_RESOLVING_FILE_PATH

The task-controller couldn't resolve the path of


the task launcher script file.

11

RELATIVE_PATH_COMPONENTS_IN_FILE_PATH The path to the task launcher script file contains


relative components (for example, "..").
CDH 5 Security Guide | 199

Appendix E - Task-controller and Container-executor Error Codes


Numeric
Code

Name

Description

12

UNABLE_TO_STAT_FILE

The task-controller didn't have permission to stat


a file it needed to check the ownership of.

13

FILE_NOT_OWNED_BY_TASKTRACKER

A file which the task-controller must change the


ownership of has the wrong the ownership.

14

PREPARE_ATTEMPT_DIRECTORIES_FAILED

The mapred.local.dir is not configured, could


not be read by the task-controller, or could not
have its ownership secured.

15

INITIALIZE_JOB_FAILED

The task-controller couldn't get, stat, or secure


the job directory or job working working directory.

16

PREPARE_TASK_LOGS_FAILED

The task-controller could not find or could not


change the ownership of the task log directory
to the passed user.

17

INVALID_TT_LOG_DIR

The hadoop.log.dir is not configured.

18

OUT_OF_MEMORY

The task-controller couldn't determine the job


directory path or the task launcher script path.

19

INITIALIZE_DISTCACHEFILE_FAILED

Couldn't get a unique value for, stat, or the local


distributed cache directory.

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user task


tracker directory.

21

UNABLE_TO_BUILD_PATH

The task-controller couldn't concatenate two


paths, most likely because it ran out of memory.

22

INVALID_TASKCONTROLLER_PERMISSIONS

The task-controller binary does not have the


correct permissions set. See Appendix B Information about Other Hadoop Security
Programs.

23

PREPARE_JOB_LOGS_FAILED

The task-controller could not find or could not


change the ownership of the job log directory to
the passed user.

24

INVALID_CONFIG_FILE

The taskcontroller.cfg file is missing, malformed,


or has incorrect permissions.

255

Unknown Error

There are several causes for this error. Some


common causes are:
There are user accounts on your cluster that
have a user ID less than the value specified
for the min.user.id property in the
taskcontroller.cfg file. The default value
is 1000 which is appropriate on Ubuntu
systems, but may not be valid for your
operating system. For information about
setting min.user.id in the
taskcontroller.cfg file, see this step.

200 | CDH 5 Security Guide

Appendix E - Task-controller and Container-executor Error Codes


Numeric
Code

Name

Description
Jobs won't run and the TaskTracker is unable
to create a Hadoop logs directory. For more
information, see (MRv1 Only) Problem 8: Jobs
won't run and TaskTracker is unable to create
a Hadoop logs directory. on page 186.
This error is often caused by previous errors;
look earlier in the log file for possible causes.

YARN ONLY: Container-executor Error Codes


The following table applies to the container-executor in YARN.
Numeric
Code

Name

Description

INVALID_ARGUMENT_NUMBER

Incorrect number of arguments provided for


the given task-controller command
Failure to initialize the container localizer

INVALID_USER_NAME

The user passed to the task-controller does not


exist.

INVALID_COMMAND_PROVIDED

The container-executor does not recognize the


command it was asked to execute.

INVALID_NM_ROOT

The passed NodeManager root does not match


the configured NodeManager root
(yarn.nodemanager.local-dirs), or does not
exist.

SETUID_OPER_FAILED

Either could not read the local groups database,


or could not set UID or GID

UNABLE_TO_EXECUTE_CONTAINER_SCRIPT

The container-executor could not execute the


container launcher script.

UNABLE_TO_SIGNAL_CONTAINER

The container-executor could not signal the


container it was passed.

INVALID_CONTAINER_PID

The PID passed to the container-executor was


negative or 0.

18

OUT_OF_MEMORY

The container-executor couldn't allocate enough


memory while reading the container-executor.cfg
file, or while getting the paths for the container
launcher script or credentials files.

20

INITIALIZE_USER_FAILED

Couldn't get, stat, or secure the per-user node


manager directory.

21

UNABLE_TO_BUILD_PATH

The container-executor couldn't concatenate two


paths, most likely because it ran out of memory.

CDH 5 Security Guide | 201

Appendix E - Task-controller and Container-executor Error Codes


Numeric
Code

Name

Description

22

INVALID_CONTAINER_EXEC_PERMISSIONS

The container-executor binary does not have the


correct permissions set. See Appendix B Information about Other Hadoop Security
Programs.

24

INVALID_CONFIG_FILE

The container-executor.cfg file is missing,


malformed, or has incorrect permissions.

25

SETSID_OPER_FAILED

Could not set the session ID of the forked


container.

26

WRITE_PIDFILE_FAILED

Failed to write the value of the PID of the


launched container to the PID file of the
container.

255

Unknown Error

There are several causes for this error. Some


common causes are:
There are user accounts on your cluster that
have a user ID less than the value specified
for the min.user.id property in the
container-executor.cfg file. The default
value is 1000 which is appropriate on Ubuntu
systems, but may not be valid for your
operating system. For information about
setting min.user.id in the
container-executor.cfg file, see this step.
This error is often caused by previous errors;
look earlier in the log file for possible causes.

202 | CDH 5 Security Guide

Appendix F - Using kadmin to Create Kerberos Keytab Files

Appendix F - Using kadmin to Create Kerberos Keytab Files


If your version of Kerberos does not support the Kerberos -norandkey option in the xst command, or if you
must use kadmin because you cannot use kadmin.local, then you can use the following procedure to create
Kerberos keytab files. Using the -norandkey option when creating keytabs is optional and a convenience, but
it is not required.
Important:
For both MRv1 and YARN deployments: On every machine in your cluster, there must be a keytab file
for the hdfs user and a keytab file for the mapred user. The hdfs keytab file must contain entries for
the hdfs principal and an HTTP principal, and the mapred keytab file must contain entries for the
mapred principal and an HTTP principal. On each respective machine, the HTTP principal will be the
same in both keytab files.
In addition, for YARN deployments only: On every machine in your cluster, there must be a keytab file
for the yarn user. The yarn keytab file must contain entries for the yarn principal and an HTTP
principal. On each respective machine, the HTTP principal in the yarn keytab file will be the same as
the HTTP principal in the hdfs and mapred keytab files.
For instructions, see To create the Kerberos keytab files on page 203.
Note:
These instructions illustrate an example of creating keytab files for MIT Kerberos. If you are using
another version of Kerberos, refer to your Kerberos documentation for instructions. You can use either
kadmin or kadmin.local to run these commands.

To create the Kerberos keytab files


Do the following steps for every host in your cluster, replacing the fully.qualified.domain.name in the
commands with the fully qualified domain name of each host:
1. Create the hdfs keytab file, which contains an entry for the hdfs principal. This keytab file is used for the
NameNode, Secondary NameNode, and DataNodes.
$ kadmin
kadmin: xst -k hdfs-unmerged.keytab hdfs/fully.qualified.domain.name

2. Create the mapred keytab file, which contains an entry for the mapred principal. If you are using MRv1, the
mapred keytab file is used for the JobTracker and TaskTrackers. If you are using YARN, the mapred keytab
file is used for the MapReduce Job History Server.
kadmin:

xst -k mapred-unmerged.keytab mapred/fully.qualified.domain.name

3. YARN only: Create the yarn keytab file, which contains an entry for the yarn principal. This keytab file is
used for the ResourceManager and NodeManager.
kadmin:

xst -k yarn-unmerged.keytab yarn/fully.qualified.domain.name

4. Create the http keytab file, which contains an entry for the HTTP principal.
kadmin:

xst -k http.keytab HTTP/fully.qualified.domain.name

CDH 5 Security Guide | 203

Appendix F - Using kadmin to Create Kerberos Keytab Files


5. Use the ktutil command to merge the previously-created keytabs:
$ ktutil
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:
ktutil:

rkt hdfs-unmerged.keytab
rkt http.keytab
wkt hdfs.keytab
clear
rkt mapred-unmerged.keytab
rkt http.keytab
wkt mapred.keytab
clear
rkt yarn-unmerged.keytab
rkt http.keytab
wkt yarn.keytab

This procedure creates three new files: hdfs.keytab, mapred.keytab and yarn.keytab. These files contain
entries for the hdfs and HTTP principals, the mapred and HTTP principals, and the yarn and HTTP principals
respectively.
6. Use klist to display the keytab file entries. For example, a correctly-created hdfs keytab file should look
something like this:
$ klist -e -k -t hdfs.keytab
Keytab name: WRFILE:hdfs.keytab
slot KVNO Principal
---- ---- --------------------------------------------------------------------1
7
HTTP/[email protected] (DES cbc mode with
CRC-32)
2
7
HTTP/[email protected] (Triple DES cbc mode
with HMAC/sha1)
3
7
hdfs/[email protected] (DES cbc mode with
CRC-32)
4
7
hdfs/[email protected] (Triple DES cbc mode
with HMAC/sha1)

7. To verify that you have performed the merge procedure correctly, make sure you can obtain credentials as
both the hdfs and HTTP principals using the single merged keytab:
$ kinit -k -t hdfs.keytab hdfs/[email protected]
$ kinit -k -t hdfs.keytab HTTP/[email protected]

If either of these commands fails with an error message such as "kinit: Key table entry not found
while getting initial credentials", then something has gone wrong during the merge procedure. Go
back to step 1 of this document and verify that you performed all the steps correctly.
8. To continue the procedure of configuring Hadoop security in CDH 5, follow the instructions in the section To
deploy the Kerberos keytab files.

204 | CDH 5 Security Guide

Appendix G - Setting Up a Gateway Node to Restrict Access

Appendix G - Setting Up a Gateway Node to Restrict Access


Use the instructions that follow to set up and use a Hadoop cluster that is entirely firewalled off from outside
access; the only exception will be one node which will act as a gateway. Client machines can access the cluster
through the gateway via the REST API.
HttpFS will be used to allow REST access to HDFS, and Oozie will allow REST access for submitting and monitoring
jobs.

Installing and Configuring the Firewall and Gateway


Follow these steps:
1.
2.
3.
4.

Choose a cluster node to be the gateway machine


Install and configure the Oozie server by following the standard directions starting here: Installing Oozie
Install HttpFS.
Start the Oozie server:
$ sudo service oozie start

5. Start the HttpFS server:


$ sudo service hadoop-httpfs start

6. Configure firewalls.
Block all access from outside the cluster.
The gateway node should have ports 11000 (oozie) and 14000 (hadoop-httpfs) open.
Optionally, to maintain access to the Web UIs for the cluster's JobTrackers, NameNodes, etc., open their
HTTP ports: see Ports Used by Components of CDH 5.
7. Optionally configure authentication in simple mode (default) or using Kerberos. See HttpFS Security
Configuration on page 93 to configure Kerberos for HttpFS and Oozie Security Configuration on page 87 to
configure Kerberos for Oozie.
8. Optionally encrypt communication via HTTPS for Oozie by following these directions.

Accessing HDFS
With the Hadoop client:
All of the standard hadoop fs commands will work; just make sure to specify -fs webhdfs://HOSTNAME:14000.
For example (where GATEWAYHOST is the hostname of the gateway machine):
$ hadoop fs -fs webhdfs://GATEWAYHOST:14000 -cat /user/me/myfile.txt
Hello World!

Without the Hadoop client:


You can run all of the standard hadoop fs commands by using the WebHDFS REST API and any program that
can do GET, PUT, POST, and DELETE requests; for example:
$ curl "http://GATEWAYHOST:14000/webhdfs/v1/user/me/myfile.txt?op=OPEN&user.name=me"
Hello World!

CDH 5 Security Guide | 205

Appendix G - Setting Up a Gateway Node to Restrict Access


Important: The user.name parameter is valid only if security is disabled. In a secure cluster, you must
a initiate a valid Kerberos session.
In general, the command will look like this:
$ curl "http://GATEWAYHOST/webhdfs/v1/PATH?[user.name=USER&]op="

You can find a full explanation of the commands in the WebHDFS REST API documentation.

Submitting and Monitoring Jobs


The Oozie REST API currently supports direct submission of MapReduce, Pig, and Hive jobs; Oozie will automatically
create a workflow with a single action. For any other action types, or to execute anything more complicated than
a single job, you will need to create an actual workflow. Any required files (e.g. JAR files, input data, etc.) must
already exist on HDFS; if they don't, you can use HttpFS to upload the files.
With the Oozie client:
All of the standard Oozie commands will work. You can find a full explanation of the commands in the
documentation for the command-line utilities.
Without the Oozie client:
You can run all of the standard Oozie commands by using the REST API and any program that can do GET, PUT,
and POST requests. You can find a full explanation of the commands in the Oozie Web Services API
documentation.

206 | CDH 5 Security Guide

Appendix H - Using a Web Browser to Access an URL Protected by Kerberos HTTP SPNEGO

Appendix H - Using a Web Browser to Access an URL Protected


by Kerberos HTTP SPNEGO
To access an URL protected by Kerberos HTTP SPNEGO, use the following instructions for the browser you are
using.
To configure Mozilla Firefox:
1. Open the low level Firefox configuration page by loading the about:config page.
2. In the Search text box, enter: network.negotiate-auth.trusted-uris
3. Double-click the network.negotiate-auth.trusted-uris preference and enter the hostname or the
domain of the web server that is protected by Kerberos HTTP SPNEGO. Separate multiple domains and
hostnames with a comma.
4. Click OK.

To configure Internet Explorer:


Follow the instructions given below to configure Internet Explorer to access URLs protected by
Configuring the Local Intranet Domain
1.
2.
3.
4.

Open Internet Explorer and click the Settings "gear" icon in the top-right corner. Select Internet options.
Select the Security tab.
Select the Local Intranet zone and click the Sites button.
Make sure that the first two options, Include all local (intranet) sites not listed in other zones and Include
all sites that bypass the proxy server are checked.
CDH 5 Security Guide | 207

Appendix H - Using a Web Browser to Access an URL Protected by Kerberos HTTP SPNEGO
5. Click Advanced and add the names of the domains that are protected by Kerberos HTTP SPNEGO, one at a
time, to the list of websites. For example, myhost.example.com. Click Close.
6. Click OK to save your configuration changes.

Configuring Intranet Authentication


1. Click the Settings "gear" icon in the top-right corner. Select Internet options.
2. Select the Security tab.
3. Select the Local Intranet zone and click the Custom level... button to open the Security Settings - Local
Intranet Zone dialog box.
4. Scroll down to the User Authentication options and select Automatic logon only in Intranet zone.
5. Click OK to save these changes.

Verifying Proxy Settings

208 | CDH 5 Security Guide

Appendix H - Using a Web Browser to Access an URL Protected by Kerberos HTTP SPNEGO
You need to perform the following steps only if you have a proxy server already enabled.
1.
2.
3.
4.
5.
6.

Click the Settings "gear" icon in the top-right corner. Select Internet options.
Select the Connections tab and click LAN Settings.
Verify that the proxy server Address and Port number settings are correct.
Click Advanced to open the Proxy Settings dialog box.
Add the Kerberos-protected domains to the Exceptions field.
Click OK to save any changes.

To configure Google Chrome:


If you are using Windows, no configuration changes are needed for Google Chrome.
On MacOS or Linux, add the --auth-server-whitelist parameter to the google-chrome command. For
example, to run Chrome from a Linux prompt, run the google-chrome command as follows,
> google-chrome --auth-server-whitelist = "hostname/domain"

CDH 5 Security Guide | 209

Appendix I - Configuring LDAP Group Mappings

Appendix I - Configuring LDAP Group Mappings


To Set up LDAP (AD) group mappings for Hadoop, add the following properties to the core-site.xml on the
NameNode:
<property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.LdapGroupsMapping</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.url</name>
<value>ldap://server</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.bind.user</name>
<value>[email protected]</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.bind.password</name>
<value>****</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.base</name>
<value>dc=cloudera-ad,dc=local</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.search.filter.user</name>
<value>(&amp;(objectClass=user)(sAMAccountName={0}))</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.search.filter.group</name>
<value>(objectClass=group)</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.search.attr.member</name>
<value>member</value>
</property>
<property>
<name>hadoop.security.group.mapping.ldap.search.attr.group.name</name>
<value>cn</value>
</property>

Ensure all your services are registered users in LDAP.


Note: In addition:
If you are using Sentry with Hive, you will also need to add these properties on the HiveServer2
node.
If you are using Sentry with Impala, add these properties on all hosts
See Users and Groups in Sentry for more information.

CDH 5 Security Guide | 211

Appendix J - Before Logging a Support Case

Appendix J - Before Logging a Support Case


Before you log a support case, ensure you have either part or all of the following information to help Support
investigate your case.
If possible, provide a diagnostic data bundle following the instructions here: Sending Diagnostic Data to
Cloudera.
Provide details about the issue such as what was observed and what the impact was.
Provide any error messages that were seen, using screen capture if necessary & attach to the case.
If you were running a command or performing a series of steps, provide the commands and the results,
captured to a file if possible.
Specify whether the issue is took place in a new install or a previously-working cluster.
Mention any configuration changes made in the follow-up to the issue being seen.
Specify the type of release environment the issue is taking place in, such as sandbox, development or
production.
The severity of the impact and whether it is causing outage.

Kerberos Issues
For Kerberos issues, your krb5.conf and kdc.conf files are valuable for support to be able to understand
your configuration.
If you are having trouble with client access to the cluster, provide the output for klist -ef after kiniting as
the user account on the client host in question. Additionally, confirm that your ticket is renewable by running
kinit -R after successfully kiniting.
Specify if you are authenticating (kiniting) with a user outside of the Hadoop cluster's realm (such as Active
Directory, or another MIT Kerberos realm).
If using AES-256 encryption, please ensure you have the Unlimited Strength JCE Policy Files deployed on all
cluster and client nodes.

SSL/TLS Issues
Specify whether you are using a private/commercial CA for your certificates, or if they are self-signed.
Clarify what services you are attempting to setup SSL/TLS for in your description.
When troubleshooting SSL/TLS trust issues, provide the output of the following openssl command:
openssl s_client -connect host.fqdn.name:port

LDAP Issues
Specify the LDAP service in use (Active Directory, OpenLDAP, one of Oracle Directory Server offerings, OpenDJ,
etc)
Provide a screenshot of the LDAP configuration screen you are working with if you are troubleshooting setup
issues.
Be prepared to troubleshoot using the ldapsearch command (requires the openldap-clients package)
on the host where LDAP authentication or authorization issues are being seen.

CDH 5 Security Guide | 213

Appendix K - Authenticating Kerberos Principals in Java Code

Appendix K - Authenticating Kerberos Principals in Java Code


This topic provides an example of how to authenticate a Kerberos principal in a Java application using the
org.apache.hadoop.security.UserGroupInformation class.
The following code snippet authenticates the cloudera principal using the cloudera.keytab file:
// Authenticating Kerberos principal
System.out.println("Principal Authentication: ");
final String user = "[email protected]";
final String keyPath = "cloudera.keytab";
UserGroupInformation.loginUserFromKeytab(user, keyPath);

CDH 5 Security Guide | 215

You might also like