Trifacta Connection Guide
Trifacta Connection Guide
Trifacta Connection Guide
Version: 7.1.2
For third-party license information, please select About Trifacta from the Help
menu.
1. Connect . 4
1.1 Connection Types . . 5
1.1.1 Create DB2 Connections 16
1.1.2 Create Oracle Connections 18
1.1.3 Create PostgreSQL Connections 20
1.1.4 Create SQL Server Connections . 22
1.1.5 Enable Teradata Connections 24
1.1.6 Enable Snowflake Connections 26
1.1.6.1 Create Snowflake Connections 28
1.1.7 Enable AWS Glue Access . 31
1.1.8 Create Azure SQL Database Connections . 36
1.1.9 Create SQL DW Connections . 38
1.1.10 Create Databricks Tables Connections . 41
1.1.11 Create SFTP Connections 44
1.1.12 Create Tableau Server Connections . 48
1.1.13 Create Salesforce Connections 52
1.1.14 Enable Alation Sources . 54
1.1.15 Enable Waterline Sources . 57
1.2 Configure Connectivity . 59
1.2.1 Enable Relational Connections . 61
1.2.2 Enable Custom SQL Query . 64
1.2.3 Configure JDBC Ingestion . 66
1.2.4 Configure Security for Relational Connections . 71
1.2.5 Enable SSO for Relational Connections . 74
1.2.6 Configure Type Inference . 78
1.3 Troubleshooting Relational Connections . 81
Page #3
Connect
This section covers how to configure JDBC integration for Trifacta® Wrangler Enterprise and connect your
working platform instance to a wide variety of JDBC-based connections.
Many of these connections can be created from the Trifacta application directly. In some cases, additional
configuration is required outside of the application.
Not Covered
This guide does not cover connection types that are deeply tied to a specific infrastructure. The following
connection types are described in the appropriate Configuration Guide.
Hadoop
AWS
Enable S3 Access
Create Redshift Connections
Configure
Disable Creating Connections for Non-Admins
Supported Environments
Trifacta Wrangler Enterprise
Amazon AWS
Microsoft Azure
Default Connections
Upload
Big Data
Apache Hadoop HDFS - Cloudera
Apache Hadoop HDFS - Hortonworks
Hive
Cloud Platforms
Amazon S3
Amazon Redshift
Snowflake
AWS Glue
Microsoft Azure WASB
Microsoft Azure ADLS Gen1
Microsoft Azure ADLS Gen2
Databricks Tables
Relational DBs
DB2
Oracle
PostgreSQL
SQL Server
Teradata
SQL DW
Azure SQL Database
Applications
Salesforce
SFTP
Tableau Server
Search Integrations
Alation
Waterline
Other Connections
JDBC relational connections
Cloud connections
Trifacta® Wrangler Enterprise supports the following types of connections. Use the links below to enable
connection to each type and, where applicable, to create new connections to individual instances of the same
type.
Notes:
Configure
By default, all users are permitted to create connections. As needed, you can disable the ability to create
connections for non-admin users.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Search for the following parameter, and set it to false:
"webapp.connectivity.nonAdminManagementEnabled": true,
Supported Environments
Amazon AWS
Microsoft Azure
Default Connections
These connections are automatically enabled and configured with the product.
Upload
Enable: Automatically enabled.
Hive
API: API Reference
Type: hive
Vendor: hive
Cloud Platforms
These connections pertain to cloud platforms with which the Trifacta platform can integrate.
Amazon S3
Supported Versions: n/a
Supported Environments:
Supported Environments:
NOTE: S3 must be set as the base storage layer. See Set Base Storage Layer.
API: API Reference
Type: redshift
vendor: redshift
Snowflake
Supported Environments:
NOTE: S3 must be set as the base storage layer. See Set Base Storage Layer.
API: API Reference
Type: snowflake
vendor: snowflake
AWS Glue
Supported Environments:
Supported Versions: n/a
Supported Environments:
Write Not supported Not supported Supported (only if WASB is base storage layer)
Supported Versions: n/a
Supported Environments:
Write Not supported Not supported Supported (only if ADLS Gen1 is base storage layer)
Supported Versions: n/a
Write Not supported Not supported Supported (only if ABFSS is base storage layer)
Databricks Tables
Supported Versions: n/a
Supported Environments:
Tip: It's easier to create a connection of this type through the UI. Typically, only one connection is
needed.
API: API Reference
Type: jdbc
Vendor: databricks
Relational DBs
NOTE: Unless otherwise noted, authentication to a relational connection requires basic authentication
(username/password) credentials.
DB2
Supported Versions: v10.5.5
Supported Environments:
API:API Reference
Type: jdbc
Vendor: db2
Oracle
Supported Versions: 12.1.0.2
Supported Environments:
API:API Reference
Type: jdbc
Vendor: oracle
PostgreSQL
Supported Versions: 9.3.10
Supported Environments:
API:API Reference
Type: jdbc
SQL Server
Supported Versions: 12.0.4
Supported Environments:
UI:
API: API Reference
Type: jdbc
Vendor: sqlserver
Teradata
Supported Versions: 14.10+
Supported Environments:
Enable:
API:API Reference
Type: jdbc
Vendor: teradata
SQL DW
Supported Versions: n/a
Supported Environments:
NOTE: For Azure deployments, some additional configuration properties must be applied. See
Configure for Azure .
Supported Versions: Azure SQL Database version 12 (other versions are not supported)
Supported Environments:
Applications
Salesforce
Supported Versions: n/a
Supported Environments:
API:API Reference
Type: jdbc
Vendor: salesforce
SFTP
Supported Environments:
API:
Type: jdbc
Vendor: sftp
Tableau Server
Supported Environments:
Type: jdbc
Vendor: tableau
NOTE: Search connections cannot be created or modified through the user interface or APIs. They are
enabled and configured through parameter.
Search capabilities are available through the application. See Import Data Page.
Alation
Waterline
Other Connections
The following connections can be created based on the available drivers.
NOTE: If you cannot create a connection of one of the following types, please contact
Trifacta Customer Success Services.
Cloud connections
FinancialForce
Force.com Applications
Pre-requisites
Configure
Use
Data Conversion
You can create connections to one or more DB2 databases from Trifacta® Wrangler Enterprise.
NOTE: Only connections to DB2 for Windows and Unix/Linux are supported.
Pre-requisites
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
Configure
To create this connection:
In the Import Data page, click the Plus sign. Then, select the Relational tab. Click the DB2 card.
You can also create connections through the Connections page. See Connections Page.
For details on values to use when creating via API, see Connection Types.
See API Reference.
Property Description
myDB2.example.com
Connect String Options Please insert any connection options as a string here.
Database Name Enter the name of the DB2 database to which to connect.
User Name (basic credential type only) Username to use to connect to the database.
Password (basic credential type only) Password associated with the above username.
Default Column Data Type Set to disabled to prevent the product from applying its own type inference to each column on import.
Inference The default value is enabled.
Use
For more information, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
DB2 Data Type Conversions.
Pre-requisites
Configure
Use
Data Conversion
You can create connections to one or more Oracle databases from Trifacta® Wrangler Enterprise.
Pre-requisites
NOTE: Dots (.) in the names of Oracle tables or table columns are not supported.
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
If you are connecting to the Oracle database using SSL, additional configuration is required. See
Configure Data Service.
Configure
To create this connection:
In the Import Data page, click the Plus sign. Then, select the Relational tab. Click the Oracle card.
You can also create connections through the Connections page. See Connections Page.
For details on values to use when creating via API, see Connection Types.
See API Reference.
Property Description
testsql.database.windows.net
User Name (basic credential type only) Username to use to connect to the database.
Password (basic credential type only) Password associated with the above username.
Test Connection After you have defined the connection credentials type, credentials, and connection string, you can validate those
credentials.
Use
For more information, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
Oracle Data Type Conversions.
Pre-requisites
Configure
Use
Data Conversion
You can create connections to one or more PostgreSQL databases from Trifacta® Wrangler Enterprise. For more
information on PostgreSQL, see https://www.postgresql.org/.
Pre-requisites
If the Trifacta databases are hosted on a PostgreSQL server, do not create a connection to this
database.
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
Configure
To create this connection:
In the Import Data page, click the Plus sign. Then, select the Relational tab. Click the PostgreSQL card.
You can also create connections through the Connections page.
See Connections Page.
For details on values to use when creating via API, see Connection Types.
See API Reference.
Property Description
my.postgres.server
Enable SSL Select the checkbox to enable SSL connections to the database.
Database Enter the name of the database on the server to which to connect.
Test Connection After you have defined the connection credentials type, credentials, and connection string, you can validate
those credentials.
Default Column Data Type Set to disabled to prevent the platform from applying its own type inference to each column on import.
Inference The default value is enabled.
Use
For more information, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
Postgres Data Type Conversions.
Pre-requisites
Configure
Use
Data Conversion
You can create connections to one or more Microsoft SQL Server databases from Trifacta® Wrangler Enterprise.
Pre-requisites
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
If you plan to create an SSO connection of this type, additional configuration may be required. See
Enable SSO for Relational Connections.
Configure
To create this connection:
In the Import Data page, click the Plus sign. Then, select the Relational tab. Click the SQL Server card.
You can also create connections through the Connections page.
See Connections Page.
For additional details on creating a SQL Server connection, see Enable Relational Connections.
For details on values to use when creating via API, see Connection Types.
See API Reference.
Property Description
testsql.database.windows.net
Connect String options Please insert the following as a single string (no line breaks):
;database=<DATABASE_NAME>;encrypt=true;trustServerCertificate=false;
hostNameInCertificate=*.database.windows.net;loginTimeout=30;
where:
Password (basic credential type only) Password associated with the above username.
Default Column Data Type Set to disabled to prevent the platform from applying its own type inference to each column on import.
Inference The default value is enabled.
Test Connection After you have defined the connection credentials type, credentials, and connection string, you can validate
those credentials.
Use
For more information, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
SQL Server Data Type Conversions.
Limitations
Download and Install Teradata drivers
Increase Read Timeout
Create Teradata Connection
Testing
Teradata provides Datawarehousing & Analytics solutions and Marketing applications. The Teradata
database supports all of their Datawarehousing solutions. For more information, see
http://www.teradata.com.
For more information on supported versions, see Connection Types.
This connection supports reading and writing. You can create multiple Teradata connections in the Trifacta
application.
Limitations
By default, Teradata does not permit the publication of datasets containing duplicate rows. Workarounds:
Your final statement for any recipe that generates results for Teradata should include a Remove
duplicate rows transformation.
NOTE: The above transformation removes exact, case-sensitive duplicate rows. Teradata
may still prevent publication for case-insensitive duplicates.
It's possible to change the default writing method to Teradata to enable duplicate rows. For more
information, contact Trifacta Support.
When creating custom datasets using SQL from Teradata sources, the ORDER BY clause in standard SQL
does not work. This is a known issue.
NOTE: Please download and install the Teradata driver that corresponds to your version of Teradata. For
more information on supported versions, see Connection Types.
Steps:
:/opt/trifacta/drivers/*
"data-service": { ...
"classpath": "%(topOfTree)s/services/data-service/build/libs/data-service.jar:%(topOfTree)s
/services/data-service/build/conf:%(topOfTree)s/services/data-service/build/dependencies/*:/opt
/trifacta/drivers/*"
The default setting is 300 seconds (5 minutes). You should consider raising this limit if you are working with large
tables.
Testing
Steps:
1. After you create your connection, load a small dataset based on a table in the connected Teradata
database. See Import Data Page.
2. Perform a few simple transformations to the data. Run the job. See Transformer Page.
3. Verify the results.
Limitations
Pre-requisites
Enable
Configure
Create Stage
Create Snowflake Connection
Testing
Snowflake provides a cloud-database datawarehouse designed for big data processing and analytics. For
more information, see https://www.snowflake.com.
Limitations
NOTE: This integration is supported only for deployments of Trifacta Wrangler Enterprise in customer-
managed AWS infrastructures. These deployments must use S3 as the base storage layer. For more
information, see Supported Deployment Scenarios for AWS.
Pre-requisites
If you do not provide a stage database, then the Trifacta platform must create one for you in the default
database. In this default database, you must include a schema named PUBLIC. For more information,
please see the Snowflake documentation.
Enable
When relational connections are enabled, this connection type is automatically available. For more information,
see Enable Relational Connections.
Configure
To create a Snowflake connection, you must enable the following feature. The job manifest feature enables the
creation of a manifest file to track the set of temporary files written to S3 before publication to Snowflake.
NOTE: This feature must be enabled when the base storage layer is set to S3. Please verify the following.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
"feature.enableJobOutputManifest": true,
Create Stage
In Snowflake terminology, a stage is a database object that points to an external location on S3. It must be an
external stage containing access credentials.
NOTE: For read-only connections to Snowflake, you must specify a Database for Stage. The
connecting user must have write access to this database.
Tip: You can specify a separate database to use for your stage.
If a stage is not specified, a temporary stage is created using the current user's AWS credentials.
NOTE: Without a defined stage, you must have write permissions to the database from which you
import. This database is used to create the temporary stage.
In the Trifacta platform, the stage location is specified as part of creating the Snowflake connection.
Testing
Steps:
1. After you create your connection, load a small dataset based on a table in the connected Snowflake
database.
NOTE: For Snowflake connections, you must have write access to the database from which you
are importing.
Snowflake is an S3-based data warehouse service hosted in the cloud. Auto-scaling, automatic failover,
and other features simplify the deployment and management of your enterprise's data warehouse. For
more information, see https://www.snowflake.com.
Pre-requisites
S3 base storage layer: Snowflake access requires installation of Trifacta software in the AWS
infrastructure and use of S3 as the base storage layer, which must be enabled. See
Set Base Storage Layer.
Same region: The Snowflake cluster must be in the same region as the default S3 bucket.
Limitations
You cannot perform ad-hoc publication to Snowflake.
SSO connections are not supported.
Create Connection
You can create Snowflake connections through the following methods.
Steps:
Property Description
mycompany
Stage If you have deployed a Snowflake stage for managing file conversion to tables, you can enter its name here. A stage is a
database object that points to an external location on S3. It must be an external stage containing access credentials.
If a stage is used, then this value is typically the schema and the name of the stage. Example value:
MY_SCHEMA.MY_STAGE
If a stage is not specified, a temporary stage is created using the current user's AWS credentials.
NOTE: Without a defined stage, you must have write permissions to the database from which you import. This
database is used to create the temporary stage.
Database (optional) If you are using a Snowflake stage, you can specify a database other than the default one to host the stage.
for Stage
NOTE: If you are creating a read-only connection to Snowflake, this field is required. The accessing user must have
write permission to the specified database.
By default, connections to Snowflake use SSL. To disable, please add the following string to your Connect String
Options:
;ssl=false
If you require connection to Snowflake through a proxy server, additional Connect String Options are required.
For more information, see
https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#specifying-a-proxy-server-in-the-jdbc-
connection-string .
Testing
Import a dataset from Snowflake. Add it to a flow, and specify a publishing action back to Snowflake. Run a job.
Supported Deployment
EMR Settings
Authentication
Limitations
Enable
Configure
Create Connection
Use
If you have integrated with an EMR cluster version 5.8.0 or later, you can configure your Hive instance to use
AWS Glue Data Catalog for storage and access to Hive metadata.
Tip: For metastores that are used across a set of services, accounts, and applications, AWS Glue is the
recommended method of access.
This section describes how to enable integration with your AWS Glue deployment.
Supported Deployment
AWS Glue tables can be read under the following conditions:
For HiveServer2 connectivity, the Trifacta node has direct access to the Master node of the EMR cluster.
EMR Settings
When you create the EMR cluster, please verify the following in the AWS Glue Data Catalog settings:
Each Glue table must be created with the following properties specified:
InputFormat
OutputFormat
Serde
To enable integration between the Trifacta platform and AWS Glue, a JAR file for managing the Trifacta
credentials for AWS access must be deployed to S3 in a location that is accessible to the EMR cluster.
When the EMR cluster is launched with the followng custom bootstrap action, the cluster does one of the
following:
Steps:
[TRIFACTA_INSTALL_DIR]/aws/glue-credential-provider/build/libs/trifacta-aws-glue-credential-provider.
jar
2. Upload this JAR file to an S3 bucket location where the EMR cluster can access it:
a. Via AWS Console S3 UI: See http://docs.aws.amazon.com/cli/latest/reference/s3/index.html.
b. Via AWS command line:
4. This script must be uploaded into S3 in a location that can be accessed from the EMR cluster. Retain the
full path to this location.
5. Add a bootstrap action to EMR cluster configuration.
a. Via AWS Console S3 UI: Create the bootstrap action to point to the script that you uploaded on S3.
b. Via AWS command line:
i. Upload the configure_glue_lib.sh file to the accessible S3 bucket.
ii. In the command line cluster creation script, add a custom bootstrap action. Example:
--bootstrap-actions '[
{"Path":"s3://<YOUR-BUCKET>/configure_glue_lib.sh","Name":"Custom action"}
]'
Authentication
Authentication methods are required permissions are based on the AWS authentication mode:
"aws.mode": "system",
system IAM role assigned to the cluster must provide access to AWS See Configure for AWS.
Glue.
user The user role must provide access to AWS Glue. See below for an example IAM role access
control.
If you are using IAM roles to provide access to AWS Glue, you can review the following fine-grained access
control, which includes the permissions required to access AWS Glue tables. Please add this to the Permissions
section of your AWS Glue Catalog Settings page.
NOTE: Please verify that access is granted in the IAM policy to the default database for AWS Glue, as
noted below.
{
"Sid" : "accessToAllTables",
"Effect" : "Allow",
"Principal" : {
"AWS" : [ "arn:aws:iam::<accountId>:role/glue-read-all" ]
},
"Action" : [ "glue:GetDatabases", "glue:GetDatabase", "glue:GetTables", "glue:GetTable", "glue:
GetUserDefinedFunctions", "glue:GetPartitions" ],
"Resource" : [ "arn:aws:glue:us-west-2:<accountId>:catalog", "arn:aws:glue:us-west-2:<accountId>:database
/default", "arn:aws:glue:us-west-2:<accountId>:database/global_temp", "arn:aws:glue:us-west-2:<accountId>:
database/mydb", "arn:aws:glue:us-west-2:<accountId>:table/mydb/*" ]
}
S3 access
AWS Glue crawls available data that is stored on S3. When you import a dataset through AWS Glue:
Any samples of your data that are generated by the Trifacta platform are stored in S3. Sample data is read
by the platform directly from S3.
Source data is read through AWS Glue.
You should review and, if needed, apply additional read restrictions on your IAM policies so that
users are limited to reading data from their own S3 directories. If all users have access to the
same areas of the same S3 bucket, then it may be possible for users to access datasets through
the platform when it is forbidden through AWS Glue.
Limitations
Access is read-only. Publishing to Glue hosted on EMR is not supported.
When using per-user IAM role-based authentication, EMR Spark jobs on AWS Glue datasources may fail if
the job is still running beyond the defined session limit after job submission time for the IAM role.
In the AWS Console, this limit is defined in hours as the Maximum CLI/API session duration assig
ned to the IAM role.
In the AWS Glue catalog client for the Hive Metadata store, the temporary credentials generated for
the IAM role expire after this limit in hours and cannot be renewed.
1. Your deployment has been configured to meet the Supported Deployment guidelines above.
2. You must integrate the platform with Hive.
NOTE: For the Hive hostname and port number, use the Master public DNS values. For more
information, see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html.
Configure
When accessing Glue using temporary per-user credentials, the credentials are given a duration of 1 hour. As
needed, you can modify this duration.
NOTE: This value cannot exceed the Maximum Session Duration value for IAM roles, as configured in
the IAM Console.
1.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Locate the following parameter. By default, this value is set to 1:
"data-service.sqlOptions.glueTempCredentialTimeoutInHours": 1
Create Connection
You can create one or more connections to databases in your AWS Glue deployment.
Key fields:
Field Description
EMR Master Node DNS This DNS value can be retrieved from the EMR console.
Port The port number through which to connect to the DNS master node
Use
After the integration has been made between the platform and AWS Glue, you can import datasets.
Limitations
Pre-requisites
Configure
Configure for SSO
Use
Data Conversion
You can create a connection to a Microsoft Azure SQL database from the Trifacta platform. This section
describes how to create connections of this type.
This connection type supports data ingestion into ADLS/WASB. When large volumes of data are read from
an Azure SQL database during job execution, the data is stored in a temporary location in ADLS/WASB.
After the job has been executed, the data is removed from the datastore. This process is transparent to the
user.
For more information on Azure SQL database, see
https://azure.microsoft.com/en-us/services/sql-database/.
Limitations
None.
Pre-requisites
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
Configure
This connection can also be created using the following property substitutions via API.
For details on values to use when creating via API, see Connection Types.
For additional details on creating an Azure SQL Database connection, see Enable Relational Connections.
Please create an Azure SQL Database connection and then specify the following properties with the listed values:
Property Description
testsql.database.windows.net
Tip: If you have access to the Azure SQL database through the Azure Portal, please copy the Connect
String from that configuration. You may omit the username and password from that version of the string.
Database (optional) Name of the Azure SQL database to which you are connecting.
User Name (for basic Credential Type) Username to use to connect to the database.
Password (for basic Credential Type) Password associated with the above username.
Credential Type
basic - Specify username and password as part of the connection
Azure Token SSO - Use the SSO principal of the user creating the connection to authenticate to the Azure
SQL database. Additional configuration is required. See Enable SSO for Azure Relational Connections.
Default Column Set to disabled to prevent the Trifacta platform from applying its own type inference to each column on import.
Data Type The default value is enabled .
Inference
If you have enabled Azure AD SSO integration for the Trifacta platform, you can create SSO connections to
Azure relational databases. See Enable SSO for Azure Relational Connections.
Use
For more information, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
SQL Server Data Type Conversions.
Limitations
Pre-requisites
Connection Types
Azure SQL DW permissions
Azure SQL DW External Data Source Name
Configure
Configure for SSO
Use
Data Conversion
This section describes how to create connections to Microsoft SQL Datawarehouse (DW).
Limitations
Microsoft SQL DW connections are available only if you have deployed the Trifacta® platform onto Azure.
SSL connections to SQL DW are required.
NOTE: In this release, this connection cannot be created through the APIs. Please create
connections of this type through the application.
Pre-requisites
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
Connection Types
The Trifacta platform supports two types of connections to an Azure SQL DW data warehouse:
SQL DW Read-only access to the SQL DW data warehouse. This This connection requires fewer permissions on the data
Read-Only connection is available on the Import Data page only. warehouse and its databases but is less performant.
SQL DW Read-write access to the SQL DW data warehouse. This This connection requires more permissions. You must
Read-Write connection is available for reading, direct publishing, and ad- also specify an External Datasource Name. See below.
hoc publishing.
CREATE TABLE**
ALTER ANY SCHEMA
ALTER ANY EXTERNAL DATA SOURCE
ALTER ANY EXTERNAL FILE FORMAT
The authenticating DB user must also have read access to the external data source.
Requirements:
The external data source must be created by the database admin on the default database defined in the
SQL DW connection. For more information:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql#d-
create-external-data-source-to-reference-azure-blob-storage
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-data-source-transact-sql#g-
create-external-data-source-to-reference-azure-data-lake-store
The External Data Source must point to the same storage location as the base storage layer for the Trifacta
platform. For example, if the base storage layer is WASB, the External Datasource must point to the same
storage account defined in Trifacta configuration. If this configuration is incorrect, then publishing and
ingestion of data fail.
For more information on privileges required for the authenticating DB user, see
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql.
Configure
To create this connection:
Please create a connection of this type in the appropriate page and modify the following properties with the listed
values:
testsql.database.windows.net
External Data For read-write connections, you must provide an External Data Source. Otherwise, the connection is read-only. See
Source Name above for details.
Credential Type
basic - Specify username and password as part of the connection
Azure Token SSO - Use the SSO principal of the user creating the connection to authenticate to the SQL
Server database. Additional configuration is required. See Enable SSO for Azure Relational Connections.
Default Column Set to disabled to prevent the Trifacta platform from applying its own type inference to each column on import.
Data Type The default value is enabled.
Inference
If you have enabled Azure AD SSO integration for the Trifacta platform, you can create SSO connections to
Azure relational databases.
NOTE: When Azure AD SSO is enabled, write operations to SQL Datawarehouse are not supported.
Use
For more information on locating data, see Database Browser.
Data Conversion
For more information on how values are converted during input and output with this database, see
SQL DW Data Type Conversions.
Limitations
Pre-requisites
Insert Databricks Access Token
Enable
Create Connection
Use
Data Conversion
Troubleshooting
Failure when importing wide Databricks Tables table
You can create a connection to Azure Databricks tables from the Trifacta platform. This section describes how to
create connections of this type.
Databricks Tables provides a JDBC-based interface for reading and writing datasets in ADLS or WASB.
Using the underlying JDBC connection, you can access your ADLS or WASB data like a relational
datastore, run jobs against it, and write results back to the datastore as JDBC tables.
Your connection to Databricks Tables leverages the SSO authentication that is native to Azure Databricks.
For more information on Databricks Tables, see
https://docs.microsoft.com/en-us/azure/databricks/data/tables.
Limitations
Ad-hoc publishing of generated results to Databricks Tables is not supported.
Creation of datasets with custom SQL is not supported.
Integration with Kerberos or secure impersonation is not supported.
Some table types and publishing actions are not supported. For more information, see
Using Databricks Tables.
Pre-requisites
The Trifacta platform must be installed on Azure and integrated with an Azure Databricks cluster.
See Install for Azure.
See Configure for Azure Databricks.
NOTE: For job execution on Spark, the connection must use the Spark instance on the
Azure Databricks cluster. No other Spark instance is supported. You can run jobs from this
connection through the Photon running environment. For more information, see
Running Environment Options.
This connection interacts with Databricks Tables through the Hive metastore that has been installed in
Azure Databricks.
Each user must insert a Databricks Personal Access Token into the user profile. For more information, see
Databricks Personal Access Token Page.
Enable
To enable Databricks Tables connections, please complete the following:
NOTE: Typically, you need only one connection to Databricks Tables, although you can create multiple
connections.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Locate the following parameter and set it to true:
"feature.databricks.connection.enabled": true,
Create Connection
This connection can also be created via API. For details on values to use when creating via API, see
Connection Types.
Please create an Databricks connection and then specify the following properties with the listed values:
NOTE: Host and port number connection information is taken from Azure Databricks and does not need
to be re-entered here. See Configure for Azure Databricks.
Property Description
Connect String options Please insert any connection string options that you need. Connect String options are not required for this
connection.
Default Column Data Set to disabled to prevent the Trifacta platform from applying its own type inference to each column on
Type Inference import. The default value is enabled.
Use
For more information, see Using Databricks Tables.
Troubleshooting
If you are attempting to import a table containing a large number of columns (>200), you may encounter an error
message similar to the following:
Solution:
To address this issue, you can increase the Kyroserializer buffer size.
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Locate the spark.props section and add the following setting. Modify 2000 (2GB) depending on
whether your import is successful:
"spark.kryoserializer.buffer.max.mb": "2000"
For more information on passing property values into Spark, see Configure for Spark.
Limitations
Pre-requisites
SSH Keys
Enable
Configure file storage protocols and locations
Enforce authentication methods
Java VFS service
Create Connection
Create through application
Create through APIs
You can create connections to SFTP servers to upload your datasets to the Trifacta® platform.
Jobs can be executed from SFTP sources on the following running environments:
Trifacta Photon
HDFS-based Spark, which includes Cloudera and Hortonworks
Spark on EMR
Azure Databricks
Limitations
Ingest of over 500 files through SFTP at one time is not supported.
You cannot run jobs using Avro or Parquet sources uploaded via SFTP.
When you specify a parameterized output as part of your job execution, the specified output location may
include additional unnecessary information about the SFTP connection identifier. None of this information
is sensitive. This is a known issue.
You cannot publish TDE or Hyper format to SFTP destinations.
You cannot publish compressed Snappy files to SFTP destinations.
Pre-requisites
Acquire user credentials to access the SFTP server. You can use username/password credentials or SSH
keys. See below.
Verify that the credentials can access the proper locations on the server where your data is stored. Initial
directory of the user account must be accessible.
SSH Keys
If preferred, you can use SSH keys to for authentication to the SFTP server.
NOTE: SSH keys must be private RSA keys. If you have OpenSSH keys, you can use the ssh-keygen
utility to convert them to private RSA keys.
NOTE: You must provide the protocol identifier and storage locations for the SMTP server. See below.
The Trifacta platform must be provided the list of protocols and locations for accessing SFTP.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Locate the following parameters and set their values according to the table below:
"fileStorage.whitelist": ["sftp"],
"fileStorage.defaultBaseUris": ["sftp:///"],
Parameter Description
filestorage. For each supported protocol, this param must contain a top-level path to the location where Trifacta platform files
defaultBase can be stored. These files include uploads, samples, and temporary storage used during job execution.
Uris
NOTE: A separate base URI is required for each supported protocol. You may only have one base URI
for each protocol.
NOTE: For SFTP, three slashes at the end are required, as the third one is the end of the path value.
This value is used as the base URI for all SFTP connections created in Trifacta Wrangler Enterprise.
Example:
sftp:////
The above example is the most common example, as it is used as the base URI for all SFTP connections that you
create. If you add a server value to the above URI, you limit all SFTP connections that you create to that specified
server.
3. Save your changes and restart the platform.
Enforce authentication methods
To eliminate these issues, you can configure the Trifacta application to enforce usage of one of the following
authentication schemes. These schemes are passed to the SFTP server during connection time, which forces the
server to use the appropriate method of authentication. When the following parameter is specified, SFTP
connections can be configured using the listed methods and should work for connecting to the server.
NOTE: Enforcement applies to connections created via the APIs as well. After configuration, please be
sure to use one of the enforced authentication methods when configuring your SFTP connections
through the application or the APIs.
Steps:
1. To apply this configuration change, login as an administrator to the Trifacta node. Then, edit trifacta-
conf.json. Some of these settings may not be available through the Admin Settings Page. For more
information, see Platform Configuration Methods.
2. Locate the following parameter in the configuration file:
"batchserver.workers.filewriter.hadoopConfig.sftp.PreferredAuthentications"
Basic "password" Basic password authentication method is used to connect to the SFTP server.
Use of SFTP connections requires the Java VFS service in the Trifacta platform.
For more information on configuring this service, see Configure Java VFS Service.
NOTE: Only an administrator can make a connection available for all users.
Steps:
Property Description
Host The hostname of the FTP server to which you are connecting. Do not include any protocol identifier (sftp://).
Port The port number to use to connect to the server. Default port number is 22.
Password (Basic credential type) The password associated with the username.
SSH Key (SSH Key credential type) The SSH key that applies to the username.
Test Click this button to test the connection that you have specified.
Connection
Default Absolute path on the SFTP server where users of the connection can begin browsing.
Directory
Block Size Fetch size in bytes for each read from the SFTP server.
(Bytes)
NOTE: Raising this value may increase speed of read operations. However, if it is raised too high,
resources can become overwhelmed, and the read can fail.
Connection The name of the connection as you want it to appear in the application.
Name
Limitations
Enable Hyper format
Enable TDE format
Download and Install Tableau SDK
Enable TDE format
Configure Permissions
Create Tableau Server Connection
Create through application
Create through APIs
This section describes the basics of creating Tableau Server connections from within the application.
NOTE: You can export Tableau files as part of exporting results from the platform. For more information,
see Publishing Dialog.
Limitations
Steps:
1. Login as an administrator.
2. You apply this change through the Workspace Settings Page. For more information, see
Platform Configuration Methods.
3. Locate the following setting:
4. Set it to Enabled.
5. No other configuration is required.
NOTE: The TDE format has been superseded by the Hyper format. Publication to TDE format will be
deprecated in a future release. Please switch to using Hyper format.
To enable generation of TDE files and publication to Tableau Server, the Tableau Server SDK must be licensed,
downloaded, and installed in the Trifacta platform.
Steps:
NOTE: The above directory should be located outside of the install directory for the platform
software.
b. Retain the path to this directory. This directory is assumed to have the following name: <tableau-
extract-dir>.
5. Platform configuration must be updated to point to this SDK. You can apply this change through the
Admin Settings Page (recommended) or trifacta-conf.json. For more information, see
Platform Configuration Methods.
6. Update the following property:
"batch-job-runner.env.LD_LIBRARY_PATH" = "<tableau-extract-dir>/lib64/tableausdk/"
7. Add to the Batch Job Runner classpath to the current classpath (<current_classpath_values>). You
must replace <tableau-extract-dir> with the path where you extracted the Tableau Server SDK:
"batch-job-runner.classpath" ="<current_classpath_values>:<tableau-extract-dir>/lib64/tableausdk/Java
/tableaucommon.jar:<tableau-extract-dir>/lib64/tableausdk/Java/tableauserver.jar:<tableau-extract-dir>
/lib64/tableausdk/Java/tableauextract.jar"
To enable the generation of results into TDE format, please complete the following:
Steps:
1. Login as an administrator.
2. You apply this change through the Workspace Settings Page. For more information, see
Platform Configuration Methods.
3. Locate the following setting:
4. Set it to Enabled.
NOTE: These permissions must be applied for both TDE and Hyper format.
Any user can create a Tableau Server connection through the application.
NOTE: Only an administrator can make a Tableau Server connection available for all users.
Steps:
Property Description
Server The URL to the Tableau Server to which you are connecting. To specify an SSL connection, use https:// for the
URL protocol identifier.
NOTE: By default, this connection assumes that the port number is 80. To use a different port, you must
specify it as part of the Server name value: http://<Tableau_Server_URL>:
<port_number>
Site Enter the value that appears after /site/ in your target location.
https://tableau.example.com/#/site/MyNewTargetSite
MyNewTargetSite
Test Click this button to test the connection that you have specified.
Connection
Limitations
Pre-requisites
Enable
Configure
Connect string options
Use
You can create connections to your Salesforce instance from Trifacta® Wrangler Enterprise. This connector is
designed as a wrapper around the Salesforce REST API.
Limitations
This is a read-only connection.
Single Sign-On (SSO) is not supported.
Custom domains are not supported.
Pre-requisites
The account used to login from Trifacta Wrangler Enterprise must access Salesforce through a security
token.
NOTE: Please contact your Salesforce administrator for the Server Name and the Security Token
values.
The logged-in user must have required access to the tables and schema. For more information, see
Using Salesforce.
If you haven't done so already, you must create and deploy an encryption key file for the Trifacta node to
be shared by all relational connections. For more information, see Create Encryption Key File.
Enable
For more information, see Enable Relational Connections.
Configure
To create this connection:
In the Connections page, select the Applications tab. Click the Salesforce card.
See Connections Page.
For details on values to use when creating via API, see Connection Types.
See API Reference.
Server Name Enter the host name of your Salesforce implementation. Example value:
exampleserver.salesforce.com
Connect String Options Apply any connection string options that are part of your authentication to Salesforce. For more information,
see below.
Security Token generated Paste the security token associated with the account to use for this connection.
in account
Test Connection After you have defined the connection credentials type, credentials, and connection string, you can validate
those credentials.
Default Column Data Type Set to disabled to prevent the platform from applying its own type inference to each column on import.
Inference The default value is enabled .
By default, Salesforce does not include system columns generated by Salesforce in any response. To include
them, add the following value to the Connect String Options textbox:
ConfigOptions=(auditcolumns=all;mapsystemcolumnnames=0)
By default, Salesforce imposes a limit on the number of calls that can be made through the REST APIs by this
connector.
You can make the number of calls unlimited by appending the following to the Connect String Options textbox:
StmtCallLimit=0
Use
You can import datasets from Salesforce through the Import Data page. See Import Data Page.
See Salesforce Browser.
For more information on interacting with Salesforce, see Using Salesforce.
Limitations
Pre-requisites
Enable Alation Navigation Integration
Testing Alation browsing integration
Enable Open With Integration
Testing open with integration
If you have integrated the Trifacta® platform with Hive, you can integrate it with Alation to simplify finding
datasets within Hive for import. The Alation integration supports the following methods:
1. Read directly from Alation through an Alation Navigator integrated into the Import Data page.
2. Locate tables through Alation and then open them with the Trifacta platform.
Alation is a data catalog service for Hive. For more information, see www.alation.com.
Limitations
You can import only tables from Alation.
You cannot use queries or select columns for import into the platform.
Cluster security features such as secure impersonation and Kerberos are supported if both users in the
integration are authenticated and impersonated.
Pre-requisites
Alation version 4.10.0 or later
Your enterprise environment must have a deployed instance of Hive to which the Trifacta platform has
already been integrated. See Configure for Hive.
You must have credentials to access Alation. You can sign up through the Alation Catalog Navigator after
the integration is complete.
NOTE: Your Hive administrator and Alation administrator must ensure that your accounts have the
appropriate permissions to search for and access datasets within these separate deployments.
You must acquire the URL for the host of your Alation deployment.
NOTE: Although the integration to Alation appears as a connection in the application, the connection
cannot be created through the application. Please complete the following steps.
This value identifies the path on the Alation server to where their integration SDK is stored. Do not
alation.sdkPath
modify this value.
alation.enabled
Set this value true to enable the integration.
Set this value to the URL of the web interface for the Alation deployment.
alation.
catalogHost
If ad-hoc publishing to Hive has been enabled, you can export the generated results to Hive and then attempt to
re-import through Alation.
NOTE: There may be a delay before the Trifacta results appear in Alation. If necessary, you can
manually refresh the catalog from inside Alation.
NOTE: To support this integration, end users must disable popup blockers in the browser. For more
information, please see your browser's documentation.
NOTE: If Kerberos is enabled, you must be authenticated into the Trifacta platform and Alation at the
same time.
Steps:
b. HTTPS: Change the protocol identifier for both URLs to https and remove the platform port
number.
where:
Parameter Description
3. A successful execution of the above command logs the following JSON message:
{"id":1,"name":"Trifacta","endpoint":"http://<platform_host>:<platform_port_num>/import/data","
accept_object_types":["table"],"accept_data_source_types":["hive",
"hive2"]}
Steps:
1. Login to Alation.
2. Search for or navigate to a database table. Click the Open With... button. From the drop-down, select Trifa
cta.
3. The table appears as an imported dataset in the Imported Dataset page.
4. You can import the dataset into a new or existing flow.
Limitations
Pre-requisites
Enable Waterline Integration
Testing Waterline browsing integration
You can integrate the Trifacta® platform with the Waterline data catalog to simplify finding datasets within your
enterprise data lake. The Waterline integration supports the following methods:
1. Read directly from Waterline through a search box integrated into the Import Data page.
2. Locate assets through Waterline and open them with the Trifacta platform.
Waterline Data is a data catalog service for Hive. For more information, see www.waterlinedata.com.
Limitations
Pre-requisites
Waterline 4.0 and higher
Waterline must be integrated with your deployment of the Trifacta platform. For more information, please
contact your Waterline administrator.
You must have credentials to access Waterline.
NOTE: Your Waterline administrator must ensure that your account has the appropriate
permissions to search for and access datasets within Waterline and its integrated sources.
You must acquire the URL for the host of your Waterline deployment.
You must acquire the hostname and port for the Trifacta platform.
NOTE: Although the integration appears as a connection in the application, the connection cannot be
created through the application. Please complete the following steps.
Property Description
waterline.enabled
Set this value true to enable the integration.
Enable
Data Service
Relational Features
Custom SQL Query
JDBC Ingestion
Configure Security
Enable SSO Connections
Type Inference
This section covers the following areas around general connectivity of the Trifacta® platform.
Additional configuration may be required for individual connection types. For more information, see
Connection Types.
Enable
The platform automatically enables connectivity to relational databases for reading in datasets and writing results
back out.
NOTE: Relational connectivity requires the use of an encryption key file, which must be created and
deployed before you create relational connections. For more information, see Create Encryption Key File
in the Install Guide.
Data Service
The platform streams records from relational sources through the data service. These records are applied to
transformation and sampling jobs on the Photon running environment, which is native to the Trifacta node.
Tip: In general, you should not have to modify settings for the data service. However, if you are
experiencing general performance issues or issues with specific connection types, you may experiment
with settings in the data service.
Relational Features
To enhance performance of your relational datasets, you can enable the use of custom SQL queries against your
relational datasources, which allows you to pre-filter your datasets before you ingest them into the platform. This
feature is enabled by default, but additional configuration can be applied. See Enable Custom SQL Query.
JDBC Ingestion
As needed, the platform can be configured to ingest data from your relational datasources to the base storage
layer for faster execution of Spark-based jobs. See Configure JDBC Ingestion.
Type Inference
By default, the platform applies type inferencing to all imported datasources. However, for schematized sources,
you may wish to disable type inferencing from the platform instead relying on the types provided from the source.
Tip: You can also toggle the use of type inferencing for individual connections or for individual imported
datasets.
The Trifacta® platform can be configured to access data stored in relational database sources over JDBC
protocol. When this connection method is used, individual database tables and views can be imported as
datasets.
The Trifacta platform can natively connect to these relational database platforms. Natively supported versions are
the following:
Oracle 12.1.0.2
SQL Server 12.0.4
PostgreSQL 9.3.10
Teradata 14.10+
NOTE: To enable Teradata connections, you must download and install Teradata drivers first. For
more information, see Enable Teradata Connections.
Additional relational connections can be enabled and configured for the platform. For more information, see
Connection Types.
Ports
For any relational source to which you are connecting, the Trifacta node must be able to access it through the
specified host and port value.
Please contact your database administrator for the host and port information.
Enable
By default, relational connections are read/write, which means that users can create connections that enable
writing back to source databases.
When this feature is enabled, writeback is enabled for all natively supported relational connection types.
See Connection Types.
Depending on the connection type, the Trifacta platform writes its data to different field types in the target
database. For more information, see Type Conversions.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Locate the following parameter and set it to false:
"webapp.connectivity.relationalWriteback.enabled": true,
Limitations
NOTE: Unless otherwise noted, authentication to a relational connection requires basic authentication
(username/password) credentials.
You cannot swap relational sources if they are from databases provided by different vendors. See
Flow View Page.
There are some differences in behavior between reading tables and views. See Using Databases.
When the relational publishing feature is enabled, it is automatically enabled for all platform-
native connection types. You cannot disable relational publishing for Oracle, SQL Server,
PostgreSQL, or Teradata connection types. Before you enable, please verify that all user
accounts accessing databases of these types have appropriate permissions.
NOTE: Writing back to the database utilizes the same user credentials and therefore permissions as
reading from it. Please verify that the users who are creating read/write relational connections have
appropriate access.
You cannot ad-hoc publish to a relational target. Relational publishing is only supported through the Run
Job page.
You write to multiple relational outputs from the same job only if they are from the same vendor.
For example, if you have two SQL Server connections A and B, you can write one set of results to A
and another set of results to B for the same job.
If A and B are from different database vendors, you cannot write to them from the same job.
Execution at scale
Jobs for large-scale relational sources can be executed on the Spark running environment. After the data source
has been imported and wrangled, no additional configuration is required to execute at scale.
When the job is completed, any temporary files are automatically removed from HDFS.
Passwords in transit: The platform uses a proprietary encryption key that is invoked each time a
relational password is shared among platform services.
Passwords at rest: For creating connections to your relational sources, you must create and reference
your own encryption key file. This encryption key is accessing your relational connections from the web
application. For more information, see Create Encryption Key File.
Limitations
Enable
Use Custom SQL Queries
To improve performance of your Hive or relational connections, custom SQL queries can be enabled to push the
initial filtration of table rows and columns back the database, which is more efficient at performing this task.
Instead of loading the entire table into the Trifacta® application and then performing the filtration through the
Transformer page, you can insert basic SQL commands as part of your relational queries to collect only the rows
and columns of interest from the source.
Limitations
See Create Dataset with SQL.
Enable
Steps:
1. You apply this change through the Workspace Settings Page. For more information, see
Platform Configuration Methods.
2. Locate the following setting:
Setting Description
enabled Set to true to enable the SQL pushdown feature. By default, this feature is enabled.
3. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
"webapp.connectivity.customSQLQuery.enableMultiStatement": false,
Setting Description
enableMultiStatem When set to true, you can insert multi-line statements in your SQL pushdown queries. The
ent default is false.
NOTE: Use of multi-line SQL has limitations. See Create Dataset with SQL.
After a dataset has been imported using custom SQL, you can edit the SQL as needed. See Dataset Details Page
.
Overview
Recommended Table Size
Performance
Limitations
Enable
Configure
Configure Ingestion
Configure Storage
Monitoring Progress
Logging
This section describes some of the configuration options for the JDBC (relational) ingestion and caching features,
which enables execution of large-scale JDBC-based jobs on the Spark running environment.
Overview
Data ingestion works by streaming a JDBC source into a temporary storage space in the base storage layer.
The job can then be run on Spark, the default running environment. When the job is complete, the temporary data
is removed from base storage or retained in the cache (if it is enabled).
Data ingestion happens only for Spark jobs. Data ingestion does not apply to Trifacta Photon jobs.
Data ingestion applies only to JDBC sources that are not native to the running environment. For example,
JDBC ingestion is not supported for Hive.
Supported for HDFS and other large-scale backend datastores.
When data is read from the source, the Trifacta® platform can populate a user-specific ingest cache, which is
maintained to limit long load times from JDBC sources and to improve overall platform performance.
The cache is allowed to remain for a predefined expiration limit, after which any new requests for the data
are pulled from the source.
If the cache fails for some reason, the platform falls back to ingest-only mode, and the related job should
complete as expected.
Job JDBC Ingestion Enabled only JDBC Ingestion and Caching Enabled
Type
transfor Data is retrieved from the source and stored in a Data is retrieved from the source for the job and refreshes the cache
mation temporary backend location for use in sampling. where applicable.
job
As needed you can force an override of the cache when executing the
sample. Data is collected from the source. See Samples Panel.
Although there is no absolute limit, you should avoid executing jobs on tables over several 100 GBs. Larger data
sources can significantly impact end-to-end performance.
Performance
Rule of thumb:
For a single job with 16 ingest jobs occurring in parallel, maximum expected transfer rate is 1 GB/minute.
Scalability:
Above is valid until the network becomes a bottleneck. Internally, the above maxed out at about 15
concurrent sources.
Default concurrent jobs = 16, pool size of 10, 2 minute timeout on pool. This is to prevent
overloading of your database.
Adding more concurrent jobs once network has bottleneck will start slow down all the transfer jobs
simultaneously.
If processing is fully saturated (# of workers is maxed):
max transfer can drop to 1/3 GB/minute.
Ingest waits for two minutes to acquire a connection. If after two minutes a connection cannot be
acquired, the job fails.
When job is queued for processing:
Job is silently queued and appears to be in progress.
Service waits until other jobs complete.
Currently, there is no timeout for queueing based on the maximum number of concurrent ingest jobs.
Limitations
Enable
To enable JDBC ingestion and performance caching, both of the following parameters must be enabled.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For
more information, see Platform Configuration Methods.
feature. When enabled, you can monitor the ingestion of long-loading JDBC datasets through the Import Data
enableLongLoading page. Default is true.
Tip: After a long-loading dataset has been ingested, importing the data and loading it in the
Transformer page should perform faster.
longloading. When long-loading is enabled, set this value to false to enable monitoring of the ingest process
addToFlow when large relational sources are added to a flow. Default is false.
longloading. When long-loading is enabled, set this value to false to enable monitoring of the ingest process
addToLibrary when large relational sources are added to the library. Default is false.
Configure
In the following sections, you can review the available configuration parameters for JDBC ingest and JBC
performance caching.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For
more information, see Platform Configuration Methods.
Configure Ingestion
batchserver.workers. Maximum number of ingester threads that can run on the Trifacta platform at the same time.
ingest.max
batchserver.workers. Memory buffer size while copying to backend storage.
ingest.
A larger size for the buffer yields fewer network calls, which in rare cases may speed up ingest.
bufferSizeBytes
NOTE: This setting rarely applies if JDBC ingest caching has been enabled.
Configure Storage
When files are ingested, they are stored in one of the following locations:
If caching is enabled:
If the global datasource cache is enabled: files are stored in a user-specific sub-folder of the path
indicated by the following parameter: hdfs.pathsConfig.globalDatasourceCache
If the global cache is disabled: files are stored in a sub-folder of the output area for each user,
named: /.datasourceCache.
If caching is disabled: files are stored in a sub-folder within the jobs area for the job group. Ingested files
are stored in as .trifacta files.
NOTE: Whenever a job is run, its source files must be re-ingested. If two or more datasets in the same
job run share the same source, only one copy of the source is ingested.
Parameter Description
datasourceCaching. When set to true, the platform uses the global data source cache location for storing cached ingest
useGlobalDatasourceCa data.
che
NOTE: When global caching is enabled, data is still stored individual locations per user.
Through the application, users cannot access the cached objects stored for other users.
When set to false, the platform uses the output directory for each user for storing cached ingest
data. Within the output directory, cached data is stored in the .datasourceCache directory.
NOTE: You should verify that there is sufficient storage in each user's output directory to
store the maximum cache size as well as any projected uploaded datasets.
hdfs.pathsConfig. Specifies the path of the global datasource cache, if it is enabled. Specify the path from the root
globalDataSourceCache folder of HDFS or other backend datastore.
Cache sizing
Parameter Description
datasourceCaching. The number of hours that an object can be cached. If the object has not been refreshed in that period of
refreshThreshold time, the next request for the datasource collects fresh data from the source.
Logging
data-service. When the logging level is set to debug, log messages on JDBC caching are recorded in
systemProperties.logging. the data service log.
level
NOTE: Use this setting for debug purposes only, as the log files can grow quite
large. Lower the setting after the issue has been debugged.
Monitoring Progress
You can use the following methods to track progress of ingestion jobs.
Through application: In the Jobs page, you can track progress of all jobs, including ingestion. Where
there are errors, you can download logs for further review.
See Jobs Page.
See Logging below.
Through APIs:
You can track status of jobType=ingest jobs through the API endpoints.
From the above endpoint, get the ingest jobId to track progress.
See https://api.trifacta.com/ee/es.t/index.html#operation/getJobGroup
Logging
Ingest:
During and after an ingest job, you can download the job logs through the Jobs page. Logs include:
Caching:
When the logging level is set to debug for the data service and caching is enabled, cache messages are logged.
These messages include:
User Security
Connection Security Levels
Credential Sharing
Technical Security
Encryption Key File
SSL
Configure long load timeout limits
Enable SSO authentication
You can apply the following Trifacta® platform features to relational connections to ensure compliance with
enterprise practices.
User Security
Private Private connections are created by individuals and are by default accessible only to the individual who
created them.
Private and shared Optionally, they can be shared by individuals with other users.
Global Global connections are either created by administrators or are private connections promoted to global by
administrators.
Credential Sharing
By default, users are permitted to share credentials through the application. Credentials can be shared in the
following ways:
A user can create a private connection to a relational database. Through the application, this private
connection can be shared with other users, so that they can access the creator's datasets.
When sharing a flow with another user, the owner of the flow can choose to share the credentials that are
necessary to connect to the datasets that are the sources of the flow.
NOTE: If enterprise policy is to disable the sharing of credentials, collaborators may need to be permitted
to store their source data in shared locations.
Tip: Credential sharing can be disabled by individual users when they share a connection. The
connection is shared, but the new user must provide new credentials to use the connection.
Steps:
"webapp.enableCredentialSharing": true,
Technical Security
The following features enhance the security of individual and global relational connections.
Passwords in transit: The platform uses a proprietary encryption key that is invoked each time a
relational password is shared among platform services.
Passwords at rest: For creating connections to your relational sources, you must create and reference
your own encryption key file. This encryption key is accessing your relational connections from the web
application.
This encryption key file must be created and installed on the Trifacta node. For more information, see
Create Encryption Key File.
SSL
You can enable SSL by adding the following string to the Connect String Opts field:
?ssl=true;
Tip: Some connection windows have a Use SSL checkbox, which also works.
For long loading relational sources, a timeout is applied to limit the permitted load time. As needed, you can
modify this limit to account for larger load times.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For
more information, see Platform Configuration Methods.
"webapp.connectivity.longLoadTimeoutMillis": 120000,
Property Description
Relational connections can be configured to leverage your enterprise Single Sign-On (SSO) infrastructure for
authentication. Additional configuration is required. For more information, see
Enable SSO for Relational Connections.
Limitations
Pre-requisites
Configure
Configure JAAS file and path
JAAS file
Specify Kerberos configuration file
Configure vendor definition file
Example Setup
Use
Sharing
This section describes how to enable relational connections to leverage your Hadoop Single Sign-On (SSO)
infrastructure. When this feature is enabled and properly configured, users can create relational (JDBC)
connections that use SSO that you have already configured.
Connections that were created before this feature is enabled continue to operate as expected without
modification.
Limitations
For this release, this feature applies to SQL Server connections only.
Cross-realm is not supported. As a result, the SQL Server instance, service principal, and Trifacta®
principal must be in the same Kerberos realm.
Pre-requisites
Kerberos SSO: You must set up SSO authentication to the Hadoop cluster using Kerberos. This feature
uses the global Kerberos keytab. For more information, see Configure for Kerberos Integration.
Configure
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For
more information, see Platform Configuration Methods.
Parameter Description
webapp.connectivity. Path on the Trifacta node to the location of the JAAS configuration file required by the DataDirect
kerberosDelegateConfigP driver.
ath
NOTE: The default location is listed below. You may wish to move this file to a location
outside of the Trifacta installation to ensure that the file is not overwritten during
upgrades.
For connections that support Kerberos-delegated authentication, the underlying driver supports a JAAS file in
which you can provide environment-specific configuration to the driver. As needed, you can modify this file.
Below is an example file, where you must apply the Kerberos global keytab and principal values that are to be
used to authenticate to use the Kerberos-delegated connections of this type:
trifacta_jaas_config {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
doNotPrompt=true
keyTab="</absolute/path/to/trifacta_jdbc_sso.keytab>"
principal="<principal_name>";
};
JDBC_DRIVER_01 {
com.sun.security.auth.module.Krb5LoginModule required debug=false
useTicketCache=true;
};
where:
keytab = the absolute path on the Trifacta node where the Kerberos global keytab is located.
principal = Set to the service principal name of the user's service account in LDAP.
<root>/etc/krb5.conf
If it doesn't exist, create it with the following content, some of which you must specify:
[libdefaults]
default_realm = <my_default_realm>
forwardable = true # Important that this is set!
[realms]
<my_default_realm> = {
kdc = <kdc_domain>
}
[domain_realm]
<my_domain> = <my_default_realm>
Setting Description
kdc For each realm that you create, you must create an entry in [realms].
For the kdc entry, apply the KDC domain that the JDBC connection should use.
my_domain For each domain to which the Kerberos delegation applies, you must create an entry in [domain_realm] .
example.com = EXAMPLE.COM
If you need to move the location of the file from the default one, please complete the following:
Steps:
1. If you haven't already done so, copy the file from its current location to its preferred location.
2. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
3. Specify the path to the new location in the following parameter:
"webapp.connectivity.krb5Path": "/etc/krb5.conf";
For each vendor that supports SSO connections you must modify a setting in a configuration file on the Trifacta
node. This change can only be applied for vendors that support Kerberized SSO connections.
Steps:
/opt/trifacta/services/data-service/build/conf/vendor
2. In the vendor directory, each JDBC vendor has a sub-directory. Open the vendor directory.
3. Edit connection-metadata.json.
4. Locate the credentialType property. Set the value to kerberosDelegate.
5. Save your changes and restart the platform.
6. When you create your connection, select kerberosDelegate from the Credential Type drop-down.
Example Setup
The following example uses the default Kerberos realm to set an SSO connection to a SQL Server instance. This
example is intended to demonstrate one way in which you can set up your SSO connections.
Steps:
NOTE: If you are using LDAP/AD SSO, you can register all of the above SPNs using AD mechanisms.
You do not have to use the delegation flags. Delegation can be managed through the UI for the service
account.
Use
When you create a new connection of a supported type, you can select the Kerberos Delegate credentials type.
When selected, no username or credentials are applied as part of the connection object. Instead, authentication
is determined via Kerberos authentication with the cluster.
Sharing
When sharing SSO connections, the credentials for the connection cannot be shared for security reasons. The
Kerberos principal for the user with whom the connection is shared is applied. That user must have the
appropriate permissions to access any required data through the connection. See Overview of Sharing.
By default, the Trifacta® platform applies its own type inference to datasets when they are imported and again
when new steps are applied to the data. This section provides information on how you can configure where type
inference is applied in the platform.
Tip: You can use the Change Column Type transformation to override the data type inferred for a
column. However, if a new transformation step is added, the column data type is re-inferred, which may
override your specific typing. You should consider applying Change Column Type transformations as late
as possible in your recipes.
For more information on how the Trifacta platform applies data types to specific sources of data on import, see
Type Conversions.
NOTE: You cannot disable type inference for Oracle sources. This is a known issue.
Hive
Redshift
Enabled "webapp.connectivity. All imported datasets from schematized sources are automatically inferred by the type
disableRelationalTypeInfe system in the Trifacta platform.
rence": false,
The inferred data types may be different from those in the source. When the dataset is
loaded, data types can be applied to individual columns through the application.
Individual connections
Individual datasets at time of import
Disabled "webapp.connectivity. For schematized data sources, type inference is not automatically inferred by Trifacta
disableRelationalTypeInfe platform.
rence": true,
Data type information is taken from the source schema and applied where applicable to
the dataset. If there is no corresponding data type in the Trifacta platform, the data is
imported as String type.
Individual connections
Individual datasets at time of import
Please perform the following configuration change to disable type inference of schematized sources at the global
level.
Steps:
1. You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json.
For more information, see Platform Configuration Methods.
2. Change the following configuration setting to true:
"webapp.connectivity.disableRelationalTypeInference": false,
Use
In the application, type inference can be applied to your imported data through the following mechanisms.
You can specify individual connections to apply or not apply Trifacta type inference when the connection is
created or edited.
NOTE: When Default Column Data Type Inference is disabled for an individual connection, Trifacta type
inference can still be applied on import of individual datasets.
When type inference has been disabled globally for schematized sources, you can choose to enable or disable it
for individual source import.
Tip: To compare how data types are imported from the schematized source or when applied by the Trifact
a platform, you can import the same schematized source twice. The first instance of the source can be
imported with type inference enabled, and the second can be imported with it disabled.
In the Import Data page, click Edit Settings on the data source card.
Tip: You can override the Trifacta data type by applying a Change Column Type transformation.
When a new transformation step is applied, each column is re-inferred for its Trifacta data type.
If the publishing destination is a schematized environment, the generated results are written to the target
environment based on the environment type. These data type mappings cannot be modified.
Solution
The encryption keyfile is missing from the Trifacta® deployment, or the keyfile has been moved without updating
the platform of the new location.
You must create and deploy this keyfile, which is required for ensuring that encrypted usernames and passwords
are used in relational connections.
NOTE: This keyfile must be created and deployed before any relational connections are created.
Deployment requires access to the file system on the Trifacta node.
After you have deployed the keyfile, you must configure the platform to point to its location. A platform restart is
not required.
Problem - Retrieving sample data for large relational tables is very slow
In some cases, you may experience slow performance in reading from database tables, or previews of large
imported datasets are timing out.
Solution
In these cases, you can experiment with the number of records that are imported per database read. By default,
this value is 25000.
You can apply this change through the Admin Settings Page (recommended) or trifacta-conf.json. For
more information, see Platform Configuration Methods.
"data-service.sqlOptions.limitedReadStreamRecords": 25000,
Related articles
Backup and Recovery
Install Databases on Amazon RDS
Using Salesforce
Share Connection Window
SQL Server Data Type Conversions