Quick Install For Amazon EMR: 4.2.1 Doc Build Date: 9/24/2018
Quick Install For Amazon EMR: 4.2.1 Doc Build Date: 9/24/2018
Amazon EMR
Version: 4.2.1
Doc Build Date: 9/24/2018
Copyright © Trifacta Inc. 2018 - All Rights Reserved. CONFIDENTIAL
For third-party license information, please select About Trifacta from the User
menu.
Quick Install - Amazon EMR
Contents:
Scenario Description
Pre-requisites
Product Limitations
Install
Desktop Requirements
Pre-requisites
Install Steps
Set up EMR Cluster
Cluster options
Specify cluster roles
Authentication
EMRFS consistent view is recommended
Set up S3 Buckets
Bucket setup
Set up EMR resources buckets
Access Policies
User policies
EMR cluster policies
EMR consistent view policies
Configure Trifacta platform for EMR
Change admin password
Verify S3 as base storage layer
Set up S3 integration
Enable EMR integration
Apply EMR cluster ID
EMR Authentication for the Trifacta platform
Additional EMR configuration
Default Hadoop job results format
Configure Spark for EMR
Configure for Redshift
Configure for EC2 Role-Based Authentication
IAM roles
AWS System Mode
Additional AWS Configuration
Use of S3 Sources
Start and Stop the Platform
Verify
Documentation
Scenario Description
This scenario assumes the following about the Trifacta® platform deployment:
NOTE: This scenario does not provide information on installing and configuring optional components, including security features. It is
intended to get the Trifacta platform installed, operational, and connected to the EMR cluster.
Pre-requisites
If you are integrating the Trifacta platform with an EMR cluster, you must acquire a license first. Additional configuration is
required. For more information, please contact please contact your Trifacta representative.
1. Read: Please read this entire document before you create the EMR cluster or install the
Trifacta platform.
2.
Product Limitations
The EC2 instance, S3 buckets, and any connected Redshift databases must be located in the same Amazon region. Cross-region
integrations are not supported at this time.
No support for Hive integration
No support for secure impersonation or Kerberos
No support for high availability and failover
Job cancellation is not supported on EMR.
When publishing single files to S3, you cannot apply an append publishing action.
Install
NOTE: Before you install, you should review the configuration content for specific instructions on setting up the Trifacta node. See
below.
Desktop Requirements
All desktop users of the platform must have the latest version of Google Chrome installed on their desktops.
Google Chrome must have the PNaCl client installed and enabled.
PNaCl Version: 0.50.x.y or later
All desktop users must be able to connect to the EC2 instance through the enterprise infrastructure.
Pre-requisites
Before you install the platform, please verify that the following steps have been completed.
1. EULA. Before you begin, please review the End-User License Agreement. See
https://docs.trifacta.com/display/PUB/End-User+License+Agreement+-+Trifacta+Wrangler+Enterprise.
2. S3 bucket. Please create an S3 bucket to store Trifacta assets. In the bucket, the platform stores metadata in the following location:
<S3_bucket_name>/trifacta
See https://s3.console.aws.amazon.com/s3/home.
3. IAM policies. Create IAM policies for access to the S3 bucket. Required permissions are the following:
The system account or individual user accounts must have full permissions for the S3 bucket:
These policies must apply to the bucket and its contents. Example:
"arn:aws:s3:::my-trifacta-bucket-name"
"arn:aws:s3:::my-trifacta-bucket-name/*"
See https://console.aws.amazon.com/iam/home#/policies
4. EC2 instance role. Create an EC2 instance role for this policy. See
https://console.aws.amazon.com/iam/home#/roles.
NOTE: The local storage environment contains the Trifacta databases, the product installation, and its log files. No
source data is ever stored within the product.
f. Security group: Use a security group that exposes access to port 3005, which is the default port for the platform.
g. Create an AWS key-pair for access: This key is used to provide SSH access to the platform, which may be required for some
admin tasks.
h. Save your changes.
3. Apply license key:
a. Acquire the license.json license key file that was provided to you by your Trifacta representative.
b. Transfer the license key file to the EC2 node that is hosting the Trifacta platform. Navigate to the directory where you stored it.
c. Make the Trifacta user the owner of the file:
d. Make sure that the Trifacta user has read permissions on the file:
cp license.json /opt/trifacta/license/
NOTE: From the EC2 Console, please acquire the instanceId, which is needed in a later step.
5. When the instance is spinning up for the first time, performance may be slow. When the instance is up, navigate to the following:
http://<public_hostname>:3005
NOTE: As soon as you login as an admin for the first time, you should immediately change the password. Select the
User Profile menu item in the upper-right corner. Change the password and click Save to restart the platform.
aws.s3.bucket.name
b. Set the value of this setting to be the bucket that you created.
9. The following setting must be specified.
"aws.mode":"system",
system Set the mode to system to enable use of EC2 instance-based authentication for access.
user Set the mode to user to utilize user-based credentials to access the EMR cluster.
Via AWS command line interface: For this method, it is assumed that you know the required steps to perform the basic configuration.
NOTE: It is recommended that you set up your cluster for exclusive use by the Trifacta platform.
Cluster options
NOTE: Please be sure to read all of the cluster options before setting up your EMR cluster.
NOTE: Please perform your configuration through the Advanced Options workflow.
Advanced Options
Software Configuration:
Release: EMR 5.7.0
Select:
Hadoop 2.7.3
Hue 3.12.0
Spark 2.1.1
Ganglia 3.7.2
Deselect everything else.
Edit the software settings:
Copy and paste the following in "Enter Configuration":
Auto-terminate cluster after the last step is completed: Disable this option.
Hardware configuration
NOTE: Please apply the sizing information for your EMR cluster that was recommended for you. If you have not done so, please
contact your Trifacta representative.
General Options
NOTE: Please verify that this location is read accessible to all users of the platform. See below for details.
Debugging: Enable.
Termination protection: Enable.
Scale down behavior: Terminate at instance hour.
Tags:
No optiions required.
Additional Options:
EMRFS consistent view: You should enable this setting. Doing so may incur additional costs. For more information, see
EMRFS consistent view is recommended below.
Custom AMI ID: None.
Bootstrap Actions:
If you are using the default credential provider, you must create a bootstrap action.
NOTE: This configuration must be completed before you create the EMR cluster. For more information, see
Authentication below.
Security Options
EC2 key pair: Please select a key/pair to use if you wish to access EMR nodes via SSH.
Permissions: Set to Custom to reduce the scope of permissions. For more information, see EMR cluster policies below.
Encryption Options
No requirements.
EC2 Security Groups:
No requirements.
If you performed all of the configuration, including the sections below, you can create the cluster.
EMR Role:
Read/write access to log bucket
Read access to resource bucket
EC2 instance profile:
If using instance mode:
Authentication
You can use one of two methods for authenticating the EMR cluster:
Role-based IAM authentication (recommended): This method leverages your IAM roles on the EC2 instance.
Custom credential provider JAR file: This method utilizes a JAR file provided with the platform. This JAR file must be deployed to all
nodes on the EMR cluster through a bootstrap action script.
You can leverage your IAM roles to provide role-based authentication to the S3 buckets.
NOTE: The IAM role that is assigned to the EMR cluster and to the EC2 instances on the cluster must have access to the data of all
users on S3.
If you are not using IAM roles for access, you can manage access using either of the following:
In either scenario, you must use the custom credential provider JAR provided in the installation. This JAR file must be available to all nodes of the
EMR cluster.
After you have installed the platform and configured the S3 buckets, please complete the following steps to deploy this JAR file.
NOTE: These steps must be completed before you create the EMR cluster.
NOTE: This section applies if you are using the default credential provider mechanism for AWS and are not using the IAM
instance-based role authentication mechanism.
Steps:
[TRIFACTA_INST_DIR]/aws/emr/build/libs/trifacta-aws-emr-credential-provider[TIMESTAMP].jar
NOTE: Do not remove the timestamp value from the filename. This information is useful for support purposes.
2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:
a. Via AWS Console S3 UI: See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html.
b. Via AWS command line:
4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the full path to this location.
5. Add bootstrap action to EMR cluster configuration.
a. Via AWS Console S3 UI: Create the bootstrap action to point to the script you uploaded on S3.
b. Via AWS command line:
i. Upload the configure_emrfs_lib.sh file to the accessible S3 bucket.
ii. In the command line cluster creation script, add a custom bootstrap action, such as the following:
--bootstrap-actions '[
{"Path":"s3://<YOUR-BUCKET>/configure_emrfs_lib.sh","Name":"Custom action"}
]'
When the EMR cluster is launched with the above custom bootstrap action, the cluster does one of the following:
http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-credentialsprovider.html
https://aws.amazon.com/blogs/big-data/securely-analyze-data-from-another-aws-account-with-emrfs/
During job execution, including profiling jobs, on EMR, the Trifacta platform writes files in rapid succession, and these files are quickly read back
from storage for further processing. However, Amazon S3 does not provide a guarantee of a consistent file listing until a later time.
To ensure that the Trifacta platform does not begin reading back an incomplete set of files, you should enable EMRFS consistent view.
NOTE: If EMRFS consistent view is enabled, additional policies are required. Details are below.
NOTE: If EMRFS consistent view is not enabled, profiling jobs may not get a consistent set of files at the time of execution. Jobs can
fail or generate inconsistent results.
DynamoDB
Amazon's DynamoDB is automatically enabled to store metadata for EMRFS consistent view.
NOTE: DynamoDB incurs costs while it is in use. For more information, see https://aws.amazon.com/dynamodb/pricing/.
Set up S3 Buckets
Bucket setup
You must set up S3 buckets for read and write access.
NOTE: Within the Trifacta platform, you must enable use of S3 as the default storage layer. This configuration is described later.
EMR Resources The S3 bucket and path where resources can be stored by the Trifacta platform for execution of Read/Write
bucket/path Spark jobs on the cluster.
EMR Logs bucket/path The S3 bucket and path where logs are written for cluster job execution. Read
Access Policies
User policies
Trifacta users require the following policies to run jobs on the EMR cluster:
]
}
{
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::__EMR_LOG_BUCKET__",
"arn:aws:s3:::__EMR_LOG_BUCKET__/*",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__",
"arn:aws:s3:::__EMR_RESOURCE_BUCKET__/*"
]
}
}
NOTE: The base storage layer must be set during initial installation and set up of the Trifacta node.
Set up S3 integration
To integrate with S3, additional configuration is required. See Enable S3 Sources.
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see
Platform Configuration Methods.
"webapp.runInEMR": false,
"webapp.runInHadoop": false,
"webapp.runInTrifactaServer": true,
"webapp.runInEMR": true,
"webapp.runInHadoop": false,
"webapp.runInDataflow": false,
"photon.enabled": true,
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see
Platform Configuration Methods.
"aws.emr.clusterId": "",
2. Set the above value to be the cluster ID for your EMR cluster.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see
Platform Configuration Methods.
Use default credential provider for all Trifacta access including "aws.credentialProvider":"default",
EMR. "aws.emr.forceInstanceRole":false,
Use default credential provider for all Trifacta access. However, "aws.credentialProvider":"default",
EC2 role-based IAM authentication is used for EMR. "aws.emr.forceInstanceRole":true,
Steps:
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For more information, see
Platform Configuration Methods.
aws.emr.resource.path Y S3 path where resources can be stored for job execution on the EMR cluster.
NOTE: Do not include leading or trailing slashes for the path value.
aws.emr.proxyUser Y This value defines the user for the Trifacta users to use for connecting to the cluster.
NOTE: Do not include leading or trailing slashes for the path value.
NOTE: If this parameter is not specified, logs are written directly to the top-level of the
bucket.
aws.emr.maxLogPollingRetries N Configure maximum number of retries when polling for log files from EMR after job success or
failure. Minimum value is 5.
For larger datasets, if the size information is unavailable, the platform recommends by default that you run the job on the Hadoop cluster. For
these jobs, the default publishing action for the job is specified to run on the Hadoop cluster, generating the output format defined by this
parameter. Publishing actions, including output format, can always be changed as part of the job specification.
As needed, you can change this default format. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf
.json. For more information, see Platform Configuration Methods.
"webapp.defaultHadoopFileFormat": "csv",
Through the Admin Settings page, you can specify the YARN queue to which to submit your Spark jobs. All Spark jobs from the Trifacta platform
are submitted to this queue.
Steps:
"spark.props.spark.yarn.queue"
By default, the application generates random samples from the first set of rows in the dataset, up to a limit. The volume of this sample set is
determined by parameter. See Configure Application Limits.
For the Spark running environment, you can enable the generation of random samples across the entire dataset, which may increase the quality
of your samples.
NOTE: This feature cannot be enabled if relational or JDBC sources are used in your deployment.
Steps:
1. In platform configuration, locate the following property, and set its value to true:
"feature.sparkSampling.enabled": true,
Allocation properties
The following properties must be passed from the Trifacta platform to Spark for proper execution on the EMR cluster.
To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-conf.json. Some of these settings
may not be available through the Admin Settings Page. For more information, see Platform Configuration Methods.
NOTE: Do not modify these properties through the Admin Settings page. These properties must be added as extra properties through
the Spark configuration block. Ignore any references in trifacta-conf.json to these properties and their settings.
"spark": {
...
"props": {
"spark.dynamicAllocation.enabled": "true",
"spark.shuffle.service.enabled": "true",
"spark.executor.instances": "0",
"spark.executor.memory": "2048M",
"spark.executor.cores": "2",
"spark.driver.maxResultSize": "0"
}
...
}
spark.dynamicAllocation.enabled Enable dynamic allocation on the Spark cluster, which allows Spark to dynamically true
adjust the number of executors.
spark.shuffle.service.enabled Enable Spark shuffle service, which manages the shuffle data for jobs, instead of true
the executors.
spark.driver.maxResultSize Enable serialized results of unlimited size by setting this parameter to zero (0). 0
When you are running the Trifacta platform on an EC2 instance, you can leverage your enterprise IAM roles to manage permissions on the
instance for the Trifacta platform. When this type of authentication is enabled, Trifacta administrators can applya role to the EC2 instance where
the platform is running. That role's permissions apply to all users of the platform.
IAM roles
Before you begin, your IAM roles should be defined and attached to the associated EC2 instance. For more information, see
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.
"aws.mode": "system",
Parameter Description
aws.credentialProvider Set this value to instance. IAM instance role is used for providing access.
aws.hadoopFsUseSharedInstanceProvider Set this value to true for CDH 5.11 and later. The class information is provided below.
"com.amazonaws.auth.InstanceProfileCredentialsProvider",
"org.apache.hadoop.fs.s3a.SharedInstanceProfileCredentialsProvider"
In the future:
CDH is moving back to using the Instance class in a future release. For details, see
https://issues.apache.org/jira/browse/HADOOP-14301.
Use of S3 Sources
To access S3 for storage, additional configuration for S3 may be required.
See Enable S3 Sources.
Stop:
Restart:
Verify
After you have installed or made changes to the platform, you should verify operations with the Hadoop cluster.
NOTE: The Trifacta platform is not operational until it is connected to a supported backend datastore.
NOTE: These steps verify operations with data sourced from HDFS. If you are verifying operations for other datastores, see
Verify Operations.
Steps:
5. In the Transformer page, some steps have already been added to your recipe, so you can run the job right away. Click Run Job.
a. See Transformer Page.
9. Click Export Results. In the Export Results window, click the CSV and JSON links to download the results to your local desktop.
10. Load these results into a local application to verify that the content looks ok.
where the path to the key file is on your local system and the Trifacta node's DNS or IP address is available through AWS.
Documentation
You can access complete product documentation in online and PDF format. From within the product, select Help menu > Product Docs.