Pdi Vertica Integration Bestpractices
Pdi Vertica Integration Bestpractices
Pdi Vertica Integration Bestpractices
Copyright Notice
Copyright 2006 - 2015 Hewlett-Packard Development Company, L.P.
Trademark Notices
Adobe is a trademark of Adobe Systems Incorporated.
Microsoft and Windows are U.S. registered trademarks of Microsoft Corporation. UNIX is a registered trademark of The Open
Group.
All Pentaho documentation referenced in this document are the property of Pentaho Corporation. 2005 2015 Pentaho
Corporation. All rights reserved.
Overview
This document provides guidance for configuring Pentaho Data Integration (PDI, also known as Kettle) to connect to HP Vertica.
This document covers only PDI. However, connectivity options for other Pentaho products should be similar to the options this
document provides.
The content in this document has been tested for PDI 5.1.0 and HP Vertica 7.0. Most of the information will also apply to earlier
versions of both products.
PDI provides two ways to connect to a database via JDBC. Both drivers ship with the PDI software:
HP Vertica-specific JDBC driver
Hewlett Packard recommends that you use the HP Vertica-specific JDBC connector for your ETL jobs. When creating a
new connection to HP Vertica, make sure you select the connector that matches your database.
Generic JDBC driver
The database JDBC files are located in the following folders of your PDI installation:
For client installations (Spoon, Pan, Kitchen, Carte): data-integration/lib
For server installations: data-integration-server/tomcat/lib
If the client is installed on the same machine as the server, you must copy the HP Vertica JDBC jar file to both of these folders.
7.1.x 7.1.x
Only the first two digits in the version number matter for client/server compatibility. For example, a 7.0.x client driver can talk to
any 7.0.x server. For more information about client driver/server compatibility, see the Connecting to Vertica guide.
Writes to WOSE
Large bulk load COPY DIRECT
Each commit becomes a new ROS container
Writes to WOS
Incremental load COPY TRICKLE
Errors when WOS overflows
By default, COPY uses the DELIMITER parser to load raw data into the database. Raw input data must be in UTF-8, delimited text
format. Data is compressed and encoded for efficient storage. If your raw data does not consist primarily of delimited text, specify
the parser that COPY should use to align most closely with the load data.
Loading Data into HP Vertica with the Standard Table Output Component
The following figure shows the standard table output component:
Note: If you are testing your queries for performance, you may not see the full query speed right away. Delays in performance
can occur because HP Vertica can be busy reorganizing the data for several hours. The amount of time that data reorganization
requires depends on how much data was loaded and the type of load.
Note: For more information about PDI transformations and steps, see Transformations, Steps, and Hops in the PDI
documentation.
Consider the following root causes for transformation flow problems:
Lookups
Scripting steps
Memory hogs
Lazy conversion
Blocking step
Commit size, Rowset size
Consider the following root causes for job workflow problems:
Operating system constraintsmemory, network, CPU
Parallelism
Execution environment
For more information about optimizing data loads with PDI:
There is a fantastic book on optimizing PDI called Pentaho Kettle Solutions. This book describes advanced operations
like clustering PDI servers.
Pentaho blog: http://blog.pentaho.com/author/mattcasters/
Pentaho Community Wiki: http://wiki.pentaho.com/display/COM/Community+Wiki+Home
Component Description
A memory-resident data structure for storing INSERT, UPDATE, DELETE, and COPY
(without /*+DIRECT*/ hints) actions. To support very fast data load speeds, the WOS
Write-Optimized Store (WOS)
stores records without data compression or indexing. The WOS organizes data by epoch
and holds both committed and uncommitted transaction data.
A highly optimized, read-oriented, disk storage structure. The ROS makes heavy use of
Read Optimized Store (ROS) compression and indexing. You can use the COPY...DIRECT and INSERT (with /*+DIRECT*/
hints) statements to load data directly into the ROS.
The database optimizer component that moves data from memory (WOS) to disk (ROS).
Tuple Mover (TM) The Tuple Mover runs in the background, performing some tasks automatically at time
intervals determined by its configuration parameters.
For more information on HP Vertica ROS and WOS, see Loading Data into the Database, in HP Vertica Best Practices for OEM
Customers.
If only a few rows contain |, you can eliminate them from the load file using a WHERE clause. Then, load the rows
separately using a different delimiter.
EN field mapped to NUMERIC column with Record is rejected and sent to the Record with the value that exceeds the
data that exceeds the scale rejected records log file. scale is silently rejected.
Note: The actual load rates that you obtain can be higher or lower. These rates depend on the properties of the data, the number
of columns, the number of projections, and hardware and network speeds. You can improve load speeds further by using multiple
parallel load streams.
Internationalization
HP Vertica supports the following internationalization features, describes in the sections that follow:
Unicode character encoding
Locales
For more information on configuring internationalization for your database, see Internationalization Parameters in the product
documentation.
Unicode Character Encoding: UTF-8 (8-bit UCS/Unicode Transformation Format)
All input data received by the database server must be in UTF-8 format. All data output by HP Vertica must also be in UTF-8
format. The ODBC API operates on data in:
UCS-2 on Windows systems
UTF-8 on Linux systems
A UTF-16 ODBC driver is available for use with the DataDirect ODBC manager.
JDBC and ADO.NET APIs operate on data in UTF-16. The HP Vertica client drivers automatically convert data to and from UTF-8
when sending to and receiving data from HP Vertica using API calls. The drivers do not transform data that you load by executing
a COPY or COPY LOCAL statement.
When the locale is non-binary, use the collation function to transform the input to a binary string that sorts in the proper
order. This transformation increases the number of bytes required for the input according to this formula
(CollationExpansion defaults to 5):
result_column_width = input_octet_width * CollationExpansion + 4
The CSV file input step must be configured to run in parallel. If you have not configured this capability, each copy of the step reads
the full file, which creates duplicates. To configure this capability, double-click the CSV file input step and make sure that Running
in parallel? is selected, as shown in the following figure.
Tests indicate that when using the HP Vertica Bulk Loader component, less memory is used on the PDI machine. On average,
loads are almost twice as fast, depending on the resources of your PDI machine and source table size.
The next example shows how to use the HP Vertica Bulk Loader when the source is a table. For this transformation, PDI again only
parallelizes the write operations.
This example generates the following parallel queries to the source database. The WHERE clause defines the data chunking:
NEW:
SELECT * FROM pentaho_pdi_s.h_customer WHERE mod(c_custkey,6) = 0;
SELECT * FROM pentaho_pdi_s.h_customer WHERE mod(c_custkey,6) = 1;
Note: c_custkey must be an integer and should be evenly distributed. If there are more keys that its mod is 3, the fourth copy
will have more rows. The fourth copy will perform slower than the other copies.
Known Issues
You may encounter the following issues when connecting to HP Vertica using PDI.
Upgrade Issues
As of 5.1.0 of PDI, the HP Vertica Plugin is not in the Pentaho Marketplace for the PDI Community Edition. When doing an upgrade
you may encounter the following message:
Download and unzip the HP Vertica Bulk Loader into the following directory:
http://ci.pentaho.com/job/Kettle-VerticaBulkLoader/
You can download the zip, and unzip it into Kettles plugins folder:
C:\<pdi-install-folder>\pentaho\plugins
The HP Vertica Bulk Loader appears as a new step in the Bulk Loaders category after restarting Spoon.
The current method for calculating the memory allocation for the HP Vertica Bulk Loader is
[Max Column Size in Bytes] * 1000 rows for each column
In the meantime, Hewlett Packard recommends that you increase the Java Memory allocation for Spoon, seeing if the additional
memory allows the HP Vertica Bulk Loader to complete successfully. For an HP Vertica table with 20 VARCHAR columns, you
should allocate 1,240 MB of memory (65,000 bytes * 20 columns * 1000 rows).
In the meantime, it is recommended to increase the Java Memory allocation
Note: This problem has been fixed in PDI 5.1.
Tracing
To allow tracing, set the following parameters:
LogLevel = Trace
LogPath = Path to the log file location