SAS Hadoop Kerberos
SAS Hadoop Kerberos
Contact Information
Phone Number: +44 (0) 1628 490613 Phone Number: +1 (919) 531-0850
1 Introduction ................................................................... 1
1.1 Purpose of the Paper ................................................................................. 1
1.2 Deployment Considerations Overview ..................................................... 1
4 References ................................................................... 20
1 Introduction
Note: The content of this paper refers exclusively to the second maintenance release (M2) of
SAS 9.4.
The paper also describes how to ensure that the SAS software components interoperate with the
secure Hadoop environment. The SAS software components covered are SAS/ACCESS to Hadoop
and SAS Distributed In-Memory Processes.
This paper is focuses on the Kerberos-based access to the Hadoop environment and does not cover
using Kerberos authentication to access the SAS environment.
1
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
• SAS does not directly interact with Kerberos. SAS relies on the underlying operating
system and APIs to handle requesting tickets, managing ticket caches, and
authenticating users.
• The operating system of the hosts, where either SAS Foundation or the SAS High-
Performance Analytics root node will be running, must use Kerberos authentication.
The Kerberos authentication used by these hosts either must be the same Kerberos
realm as the secure Hadoop environment or have a trust that is configured against that
Kerberos realm.
• The Kerberos Ticket-Granting Ticket (TGT), which is generated at the initiation of the
user’s session, is stored in the Kerberos ticket cache. The Kerberos ticket cache must
be available to the SAS processes that connect to the secure Hadoop environment.
Either the jproxy process started by SAS Foundation or the SAS High-Performance
Analytics Environment root node need to access the Kerberos ticket cache.
• On Linux and most UNIX platforms, the Kerberos ticket cache will be a file. On
Linux, by default, this will be /tmp/krb5cc_<uid>_<rand>. By default on Windows, the
Kerberos ticket cache that is created by standard authentication processing is in
memory. Windows can be configured to use MIT Kerberos and then use a file for the
Kerberos ticket cache.
• Microsoft locks access to the Kerberos Ticket-Granting Ticket session key when using
the memory Kerberos Ticket Cache. To use the Ticket-Granting Ticket for non-
Windows processes, you must add a Windows registry key in the Registry Editor.
• The SAS Workspace Server or other server started by the SAS Object Spawner might
not have the correct value set for the KRB5CCNAME environment variable. This
environment variable points to the location of the Kerberos ticket cache. Code can be
added to the WorkspaceServer_usermods.sh to correct the value of the
KRB5CCNAME environment variable.
• Kerberos attempts to use the highest available encryption strength for the Ticket-
Granting Ticket. (In most cases, this is 256-bit AES.) Java, by default, cannot process
256-bit AES encryption. To enable Java processes to use the Ticket-Granting Ticket,
you must download the Unlimited Strength Jurisdiction Policy Files and add them to
the Java Runtime Environment. Due to import regulations in some countries, you
should verify that the use of the Unlimited Strength Jurisdiction Policy Files is
permissible under local regulations.
• There can be three different Java Runtime Environments (JRE) in use in the complete
system. There is the JRE used by the Hadoop Distribution, the SAS Private JRE used
by SAS Foundation, and the JRE used by the SAS High-Performance Analytics
Environment. All of these JREs might require the Unlimited Strength Jurisdiction
Policy Files.
• You need to regenerate the Hadoop configuration file (an XML file that describes the
Hadoop environment) after Kerberos is enabled in Hadoop. The XML file used by
2
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
SAS merges several configuration files from the Hadoop environment. Which files are
merged depends on the version of MapReduce that is used in the Hadoop environment.
• The SAS LIBNAME statement and PROC HADOOP statement have different syntax
when connecting to a secure Hadoop environment. In both cases, user names and
passwords are not submitted.
3
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Currently Hadoop security is an evolving field. Most major Hadoop distributors are developing
competing projects. Some examples of such projects are Cloudera Sentry and Apache Knox Gateway.
A common feature of these security projects is that they have Kerberos enabled for the Hadoop
environment.
The non-secure configuration relies on client-side libraries. As part of the protocol, these libraries
send the client-side credentials as determined from the client-side operating system. While not secure,
this configuration is sufficient for many deployments that rely on physical security. Authorization
checks through ACLs and file permissions are still performed against the client-supplied user ID.
After Kerberos is configured, Kerberos authentication is used to validate the client-side credentials.
This means that, when connecting to the client, you must request a Service Ticket that is valid for the
Hadoop environment. The client submits this Service Ticket as part of the client connection.
Kerberos provides strong authentication. Tickets are exchanged between client and server, and
validation is provided by a trusted third party in the form of the Kerberos Key Distribution Center.
To create a new Kerberos Key Distribution Center specifically for the Hadoop environment, follow
the standard instructions from the Cloudera or Hortonworks. See the following figure.
This process is used to authenticate both users and server processes. For example, with Cloudera 4.5,
the management tools include all the required scripts to configure Cloudera to use Kerberos. Running
these scripts after you register an administrator principal causes Cloudera to use Kerberos. This
4
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
process can be completed in minutes after the Kerberos Key Distribution Center is installed and
configured.
After the user has a Ticket-Granting Ticket, the client application provides access to Hadoop services
and initiates a request for the Service Ticket (ST). This ST request corresponds to the Hadoop service
that the user is accessing. The ST is first sent, as part of the connection, to the Hadoop service. The
Hadoop service then authenticates the user. The service decrypts the ST using the Service Key, which
is exchanged with the Kerberos Key Distribution Center. If this decryption is successful, the end user
is authenticated to the Hadoop Service.
5
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Cloudera expects the customer to use the MIT Kerberos, Release 5. Cloudera’s solution for customers
who want to integrate into a wider Active Directory domain structure is to implement a separate MIT
Kerberos KDC for the Cloudera cluster. Then you implement the required trusts to integrate the KDC
into the Active Directory. Using an alternative Kerberos distribution or even a locked-down version
of the MIT distribution, as found in the Red Hat Identity Manager product, is not supported. The
Cloudera scripts issue MIT Kerberos specific commands and fail if the MIT version of Kerberos is
not present.
The Cloudera instructions tell the user to manually create a Kerberos administrative user for the
Cloudera Manager Server. Then subsequent commands that are issued to the KDC are driven by the
Cloudera scripts. The following principals are created by these scripts:
• HTTP/[email protected]
• hbase/[email protected]
• hdfs/[email protected]
• hive/[email protected]
• hue/[email protected]
• impala/[email protected]
• mapred/[email protected]
• oozie/[email protected]
• yarn/[email protected]
• zookeeper/[email protected]
6
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Multiple principals are created for services that are running on multiple nodes, as shown in the list
above by “fullyqualified.node.names." For example, with three HDFS nodes running on hosts
chd01.exmaple.com, cdh02.exmaple.com, and chd03.example.com, there will be three principals:
hdfs/[email protected], hdfs/[email protected], and
hdfs/[email protected].
In addition, the automated scripts create Kerberos Keytab files for the services. Each Kerberos
Keytab file contains the resource principal’s authentication credentials. These Keytab files are then
distributed across the Cloudera installation on each node. For example, for most services, on a data
node, the following Kerberos Keytab files and locations are used:
• /var/run/cloudera-scm-agent/process/189-impala-IMPALAD/impala.keytab
• /var/run/cloudera-scm-agent/process/163-impala-IMPALAD/impala.keytab
• /var/run/cloudera-scm-agent/process/203-mapreduce-TASKTRACKER/mapred.keytab
• /var/run/cloudera-scm-agent/process/176-mapreduce-TASKTRACKER/mapred.keytab
• /var/run/cloudera-scm-agent/process/104-hdfs-DATANODE/hdfs.keytab
• /var/run/cloudera-scm-agent/process/109-hbase-REGIONSERVER/hbase.keytab
• /var/run/cloudera-scm-agent/process/121-impala-IMPALAD/impala.keytab
• /var/run/cloudera-scm-agent/process/192-impala-IMPALAD/impala.keytab
• /var/run/cloudera-scm-agent/process/200-hbase-REGIONSERVER/hbase.keytab
• /var/run/cloudera-scm-agent/process/216-impala-IMPALAD/impala.keytab
• /var/run/cloudera-scm-agent/process/173-hbase-REGIONSERVER/hbase.keytab
• /var/run/cloudera-scm-agent/process/182-yarn-NODEMANAGER/yarn.keytab
• /var/run/cloudera-scm-agent/process/168-hdfs-DATANODE/hdfs.keytab
• /var/run/cloudera-scm-agent/process/128-yarn-NODEMANAGER/yarn.keytab
• /var/run/cloudera-scm-agent/process/195-hdfs-DATANODE/hdfs.keytab
• /var/run/cloudera-scm-agent/process/209-yarn-NODEMANAGER/yarn.keytab
• /var/run/cloudera-scm-agent/process/142-hdfs-DATANODE/hdfs.keytab
• /var/run/cloudera-scm-agent/process/112-mapreduce-TASKTRACKER/mapred.keytab
• /var/run/cloudera-scm-agent/process/156-yarn-NODEMANAGER/yarn.keytab
• /var/run/cloudera-scm-agent/process/147-hbase-REGIONSERVER/hbase.keytab
• /var/run/cloudera-scm-agent/process/150-mapreduce-TASKTRACKER/mapred.keytab
7
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
All of these tasks are well managed by the automated process. The only manual steps are as follows:
The principal names must match the values that are provided in the table. In addition, four special
principals are required for Ambari:
8
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Hortonworks expects the keytab files to be located in the /etc/security/keytabs directory on each host
in the cluster. The user must manually copy the appropriate keytab file to each host. If a host runs
more than one component (for example, both NodeManager and DataNode), the user must copy
9
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
keytabs for both components. The Ambari Smoke Test User, the Ambari HDFS User, and the Ambari
HBase User keytabs should be copied to all hosts on the cluster. These steps are covered in the
Hortonworks documentation under the first step entitled “Preparing Kerberos.”
The second step from the Hortonworks documentation is “Setting up Hadoop Users." This step covers
creating or setting the principals for the users of the Hadoop environment. After all of the steps have
been accomplished, Kerberos Security can be enabled in the Ambari Web GUI. Enabling Kerberos
Security is the third and final step in the documentation.
10
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
This section deals with how SAS interoperates with the secure Hadoop environment. This document
does not cover using Kerberos to authenticate into the SAS environment. The document only covers
using Kerberos to authenticate from the SAS environment to the secure Hadoop environment.
With Windows systems, the authentication processing is tightly integrated with Active Directory and
Microsoft’s implementation of Kerberos. There is one difference to note between Windows and
Linux: The Kerberos Ticket Cache on Windows is memory-based, but on Linux, it is file-based.
Other operating systems such as AIX and Solaris have configuration options similar to those of Linux
and provide a file-based Kerberos Ticket Cache.
The SAS processes that access the secure Hadoop environment must access the each user's Kerberos
Ticket Cache. In Linux or UNIX environments, this typically means having access to the
KRB5CCNAME environment variable, which points to a valid file. For Linux, the Kerberos Ticket
Cache is typically /tmp/krb5cc_<uid>_<rand>.
In Windows environments, Microsoft manages access to the Kerberos Ticket Cache in memory.
Microsoft does not allow access to the session key of the Ticket-Granting Ticket (TGT) to non-
11
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Windows processes. So a Window Registry update is required for SAS to access the session key and
hence use the TGT. The REG_DWORD key AllowTgtSessionKey registry key must be added to
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters with a
value of 1.
There is an alternative on Windows. Rather than using the integrated authentication processes tightly
linked with Active Directory, you can use a separate deployment of MIT Kerberos to authenticate the
user and then access the secure Hadoop environment. When using MIT Kerberos for Windows, the
Kerberos Ticket Cache can either reference the memory location managed by Windows or be stored
in a file as on Linux systems.
To configure MIT Kerberos to use the file system and for the file system Kerberos Ticket Cache to be
used, complete the following steps:
1. Define environment variables for the MIT Kerberos so that it knows where to find the
Kerberos configuration and where to put the Ticket Cache.
KRB5_CONFIG=C:\ProgramData\MIT\Kerberos5\krb5.ini
KRB5CCNAME=FILE:%USERPROFILE%\krb5cc_<username>
2. Run the following Kerberos commands from the bin subdirectory of the MIT Kerberos install
directory.
4. Tell Java where the JAAS configuration file is located by doing either of the following:
b. Set the login.config.url.1 property in the java.security file for the JRE:
• Example: login.config.url.1=file:C:\ProgramData\MIT\Kerberos5\jaas.conf
5. Tell Java where the Kerberos configuration file is located by doing either of the following:
12
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
b. Place the kerberos configuration file in a known location. Here are two examples:
• 'C:\Windows\krb5.ini'
• '<JRE_HOME>\krb5.conf'
6. For debug purposes, you can add the following JVM parameters:
• -Dsun.security.krb5.debug=true
• -Dsun.security.jgss.debug=true
SAS Foundation can either be configured when you run the SAS Deployment Wizard to initially
deploy the SAS server or configured as a manual step after the deployment is complete.
The SAS log contains the value of the KRB5CCNAME environment variable for the current user’s
SAS session. Here is an example of the output:
43 %let krb5env=%sysget(KRB5CCNAME);
44 %put &KRB5ENV;
FILE:/tmp/krb5cc_100001_ELca0y
This file can then be checked on the operating system to confirm that the correct Kerberos Ticket
Cache is identified. If the incorrect Kerberos Ticket Cache is being passed in the KRB5CCNAME
environment variable (or it is not being passed at all), code can be added to the start-up of the SAS
session to correctly set the environment variable. For example, adding the following to
<SAS_CONFIG>/SASApp/WorkspaceServer/WorkspaceServer_usermods.sh searches the
/tmp directory for a valid Kerberos Ticket Cache for the user and sets the environment variable.
13
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
workspace_user=$(whoami)
workspace_user_ccaches=$(find /tmp -maxdepth 1 -user ${workspace_user} -
type f -name "krb5cc_*" -printf '%T@ %p\n' | sort -k 1nr | sed 's/^[^ ]*
//' | head -n 1)
There are two different types of SAS processes that require access to the Kerberos Ticket Cache. First
for LIBNAME statements, SAS Foundation launches a jproxy Java process. This process loads the
Hadoop JAR files that are specified by the SAS_HADOOP_JAR_PATH environment variable. The
jproxy process is then the client that connects to Hadoop. So it is this Java process that needs access
to the Kerberos Ticket Cache.
Enabling Java to process the 256-bit AES TGT requires the Java Cryptography Extension (JCE)
Unlimited Strength Jurisdiction Policy Files. These files can be downloaded from Oracle for most
operating systems. However, for AIX, because the IBM JRE is used, the Policy Files can be
downloaded from IBM. The files must then be copied into the Java Runtime Environment
14
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
lib/security subdirectory. Due to import regulations in some countries, you should verify that the use
of the Unlimited Strength Jurisdiction Policy Files is permissible under local regulations.
The Unlimited Strength Jurisdiction Policy Files are required by the SAS Private Java Runtime
Environment. This environment is used by SAS Foundation for issuing a LIBNAME statement to the
secure Hadoop environment. If the SAS Grid Manager is licensed, the JCE Unlimited Strength Policy
files will be needed on all machines. They are required for all instances of SAS Foundation that might
issue a LIBNAME statement. In addition, the SAS High-Performance Analytics Environment, when
accessing the SAS Embedded Process, uses a Java Runtime Environment. This also requires the
Unlimited Strength Jurisdiction Policy Files.
1. For Cloudera, from the Status page of the Cloudera Manager application, select the required
cluster and select View Client Configuration URLs.
2. From the pop-up window, select the link to the appropriate service to download a ZIP file that
contains the service configuration files.
• If you use MapReduce 1, merge the properties from the Hadoop core, Hadoop HDFS,
and MapReduce configuration files into a single configuration.
15
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
• If you use MapReduce 2, merge the properties from the Hadoop core, Hadoop HDFS,
MapReduce 2, and YARN configuration files into a single configuration file.
3. After the merged configuration file is available, place it in a location that can be read by all
users running either the LIBNAME statement or PROC HADOOP code. If SAS Grid
Manager is used to access the secure Hadoop environment, then every grid node that runs
SAS Code requires access to the file.
1. For Hortonworks, the Ambari interface does not provide a simple mechanism to collect the
client configuration files. The configuration files should be found under the /etc folder
structure in the Hadoop environment. Retrieve the Hadoop core, Hadoop HDFS, and
MapReduce 1 configuration files from the Hadoop environment and merge into a single
configuration file.
2. After the merged configuration file is available, place it in a location that can be read by all
users running either the LIBNAME statement or PROC HADOOP code. If SAS Grid
Manager is used to access the secure Hadoop environment, then every grid node running SAS
Code requires access to the file.
In the default setup, the USER and PASSWORD options are provided on the connection. These are
not valid for Kerberos connections and must be removed from the LIBNAME statement. Instead the
HDFS_PRINCIPAL and/or HIVE_PRINCIPAL are specified.
/* HDFS Libname */
libname HDFS hadoop server="gatecdh01.gatehadoop.com"
HDFS_PRINCIPAL="hdfs/[email protected]"
HIVE_PRINCIPAL="hive/[email protected]"
HDFS_TEMPDIR="/user/sasdemo/temp"
HDFS_METADIR="/user/sasdemo/meta"
16
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
HDFS_DATADIR="/user/sasdemo/data";
The example above illustrates the two types of Hadoop LIBNAME statements that can be used. The
first connects via HiverServer2. The second connects directly to HDFS. Use the second format when
SAS maintains an XML-based metadata description of HDFS files and tables. You can create XML-
based metadata with PROC HDMD. The filetype for an XML-based metadata description that PROC
HDMD produces is SASHDMD (for example, product_table.sashdmd). Another name for this
metadata is a SASHDMD descriptor.
To debug issues with the LIBNAME statement or SAS/ACCESS, include the following option
statement at the beginning of your submitted code:
option SASTRACE = "d,d,d,d" sastraceloc=saslog;
To echo to the SAS log the location of the Hadoop configuration file used by SAS Foundation,
include the following two lines in your submitted code:
%let CONFIG_PATH=%sysget(SAS_HADOOP_CONFIG_PATH);
%put &CONFIG_PATH;
To echo to the SAS log the location of the Hadoop JAR files used by SAS Foundation, include the
following two lines in your submitted code:
%let JAR_PATH=%sysget(SAS_HADOOP_JAR_PATH);
%put &JAR_PATH;
This code submits the HDFS make directory command and creates the directory specified.
17
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
This option is covered in the SAS High-Performance Analytics Infrastructure – Installation and
Configuration Guide. (Access instructions are in the Instructions.html file of your deployment.) The
option can be checked in an installed environment by reviewing the /opt/sas/TKGrid/tkmpirsh.sh
script and examining this line:
export MPI_OPTIONS="$MPI_OPTIONS -genv DISPLAY=$DISPLAY -genvlist `env |
sed -e s/=.*/,/ | sed /KRB5CCNAME/d | tr -d '\n'`TKPATH,LD_LIBRARY_PATH"
The option GRIDRSHCOMMAND can be used to set the SSH command used by SAS Foundation to
initialize the connection to the SAS High-Performance Analytics Environment. By default, SAS
Foundation uses a built-in SSH command to make this connection. This option enables you to use an
alternative SSH command and allows debug options to be specified on the SSH command. To use
Kerberos for the connection to the SAS High-Performance Analytics Environment via the GSSAPI,
you must specify an alternative command. In addition, specific options can be passed to the SSH
command to prevent authentication with password or public keys:
18
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
Adding –vvv after the SSH command enables verbose debugging for the SSH command, and this is
returned to the SAS log:
option set=GRIDRSHCOMMAND="/usr/bin/ssh -vvv -o StrictHostKeyChecking=no -
o PasswordAuthentication=no -o PubkeyAuthentication=no";
19
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
4 References
Cloudera Inc. 2014. Configuring Hadoop Security with Cloudera Manager. Palo Alto, CA: Cloudera
Inc.
Hortonworks, Inc. 2014. "Setting Up Kerberos for Hadoop 2.x." Hortonworks Data Platform:
Installing Hadoop Using Apache Ambari. Palo Alto, CA: Hortonworks, Inc.
LWN.net. 2014. "How to configure sssd on SLES 11 to resolve names and authenticate to Windows
2008 and Active Directory." Novell, Inc.
Red Hat, Inc. 2013. "Chapter 12. Configuring Authentication." Red Hat Linux 6: Deployment Guide,
5th ed. Raleigh, NC: Red Hat Inc.
SAS Institute Inc. 2014. "LIBNAME Statement Specifics for Hadoop." SAS/ACCESS 9.4 for
Relational Databases: Reference, 4th ed. Cary, NC: SAS Institute Inc.
SAS Institute Inc. 2014. Configuration Guide for SAS 9.4 Foundation for UNIX Environments. Cary,
NC: SAS Institute Inc.
SAS Institute Inc. 2014. SAS 9.4 In-Database Products: Administrator's Guide, 4th ed. Cary, NC:
SAS Institute Inc.
20
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
5 Recommended Reading
• SAS Institute Inc. Hadoop: What it is and why it matters. Cary, NC: SAS Institute, Inc.
• SAS Institute Inc. SAS 9.4 Support for Hadoop. Cary, NC: SAS Institute, Inc.
• SAS Institute Inc. SAS In-Memory Statistics for Hadoop. Cary, NC: SAS Institute, Inc.
• SAS Institute Inc. 2014. "Hadoop Procedure." Base SAS 9.4 Procedures Guide, Third
Edition. Cary, NC: SAS Institute, Inc.
21
HADOOP WITH KERBEROS - DEPLOYMENT CONSIDERATIONS
It would have been impossible to create this paper without the invaluable input of the following
people:
22
SAS INSTITUTE INC. WORLD HEADQUARTERS SAS CAMPUS DRIVE CARY, NC 27513
TEL: 919 677 8000 FAX: 919 677 4444 U.S. SALES: 800 727 0025 WWW.SAS.COM
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks
of SAS Institute Inc.in the USA and other countries. ® indicates USA registration. Other brand and
product names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc.